Item Response Theory?

What is Item Response Theory?

Some marketing researchers and data scientists are quite familiar with Item Response Theory (IRT) but I suspect they are in the minority, and many will have never heard of it. It is a fairly technical subject.

I came into contact with IRT through psychometrics and educational psychology and it has held my interest over the years because of its relevance to survey research.

Many articles and books have been published on IRT, and some resources on or related to it I have found helpful are:  

  • The Theory and Practice of Item Response Theory (de Ayala)
  • Item Response Theory and Modeling (Raykov and Marcoulides)
  • Handbook of Item Response Theory Modeling (Reise and Revicki)
  • Ordinal Item Response Theory: Mokken Scale Analysis (van Schuur)
  • Multidimensional Item Response Theory (Reckase)
  • Journal of Survey Statistics and Methodology (AAPOR and ASA)
  • Journal of Educational and Behavioral Statistics (ASA)
  • British Journal of Mathematical and Statistical Psychology (Wiley)

An extremely comprehensive source is Handbook of Item Response Theory (van der Linden et al.), a mammoth three-volume set. (I only have Volume Three, which is concerned with applications.)

Below are some excerpts from the van der Linden book which describes what IRT is and why it was developed. Any copy/paste and editing errors are mine.

“Item response theory (IRT) has its origins in pioneering work by Louis Thurstone in the 1920s, a handful of authors such as Lawley, Mosier, and Richardson in the 1940s, and more decisive work by Alan Birnbaum, Frederic Lord, and George Rasch in the 1950s and 1960s. The major breakthrough it presents is the solution to one of the fundamental flaws inherent in classical test theory—its systematic confounding of what we measure with the test items used to measure it.

Test administrations are observational studies, in which test takers receive a set of items and we observe their responses. The responses are the joint effects of both the properties of the items and abilities of the test takers. As in any other observational study, it would be a methodological error to attribute the effects to one of these underlying causal factors only.


Nevertheless, it seems as if we are forced to do so. If new items are field tested, the interest is exclusively in their properties, and any confounding with the abilities of the largely arbitrary selection of test takers used in the study would bias our inferences about them. Likewise, if examinees are tested, the interest is only in their abilities and we do not want their scores to be biased by the incidental properties of the items.

Classical test theory does create such biases. For instance, it treats the p-values of the items as their difficulty parameters, but these values equally depend on the abilities of the sample of test takers used in the field test. In spite of the terminology, the same holds for its item-discrimination parameters and definition of test reliability.

On the other hand, the number-correct scores typically used in classical test theory are scores equally indicative of the difficulty of the test as the abilities of test takers. In fact, the tradition of indexing such parameters and scores by the items or test takers only systematically hides this confounding.

IRT solves the problem by recognizing each response as the outcome of a distinct probability experiment that has to be modelled with separate parameters for the item and test taker effects. Consequently, its item parameters allow us to correct for item effects when we estimate the abilities.

Likewise, the presence of the ability parameters allows us to correct for their effects when estimating the item parameter. One of the best introductions to this change of the paradigm is Rasch, which is mandatory reading for anyone with an interest in the subject…and places the new paradigm in the wider context of the research tradition still found in the behavioral and social sciences with its persistent interest in vaguely defined ‘populations’ of subjects, who, except for some random noise, are treated as exchangeable, as well as its use of statistical techniques as correlation coefficients, analysis of variance, and hypothesis testing that assume ‘random sampling’ from them.

The developments since the original conceptualization of IRT have remained rapid…Not only have the original models for dichotomous responses been supplemented with numerous models for different response formats or response processes, it is now clear, for instance, that models for response times on test items require the same type of parameterization to account both for the item and test taker effects.

Another major development has been the recognition of the need of deeper parameterization due to a multilevel or hierarchical structure of the response data. This development has led to the possibility to introduce explanatory covariates, group structures with an impact on the item or ability parameters, mixtures of response processes, higher-level relationships between responses and response times, or special structures of the item domain, for instance, due to the use of rule-based item generation.

Meanwhile, it has also become clear how to embed IRT in the wider development of generalized latent variable modelling. And as a result of all these extensions and new insights, we are now keener in our choice of treating the model parameter as fixed or random…

Like any other type of probabilistic modelling, IRT heavily depends on the use of statistical tools for the treatment of its models and their applications.

Nevertheless, systematic introductions and review with an emphasis on their relevance to IRT are hardly found in the statistical literature…[FOR EXAMPLE] topics such as commonly used probability distributions in IRT, the issue of models with both intentional and nuisance parameters, the use of information criteria, methods for dealing with missing data, model identification issues, and several topics in parameter estimation and model fit and comparison.

It is especially in these last two areas that recent developments have been overwhelming. For instance…thanks to the computational success of Markov chain Monte Carlo methods, these approaches have now become standard, especially for the more complex models…”

Source: Handbook of Item Response Theory (van der Linden et al.).


Article by channel:

Read more articles tagged: Featured, Marketing Analytics