Brian D. Bontempo, PhD.
Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

One common metric used to assess item quality is the point biserial correlation coefficient (rpb). The “pt bis” as it is sometimes called is the correlation between an item score (1/0) and the total score on a test. Positive values are desirable and indicate that the item is good at differentiating between high ability and low ability examinees.

As straightforward as the concept may seem, there are a handful of different flavors of the point biserial. The most basic flavor relates the examinee’s item scores with their total RAW scores on the test. A slightly more robust version of this, called the corrected point biserial, calculates the relationship between the item score and the test score after removing the responses of the item in question to the total score. This flavor is very important for short tests where one item can dramatically impact the total score. Another flavor is the point-to-measure correlation which correlates the item score with the IRT ability estimate rather than the raw score. The point-to-measure correlation can come in both corrected and uncorrected forms.

Since test makers rely very heavily on the pt bis as a measure of item quality, it is important to remember that the point-biserial is greatly affected by restriction of range issues in the total test score (or measure). That is, when the total scores do not vary greatly, then the value of the point biserials is smaller than the actual relationship. This is most noticeable in difficult items where the total scores of people who answered an item correctly tend to be quite similar (in fact, they are often close to perfect scores).

Given that the size of a point biserial is greatly impacted by the variance in the scores of the examinees, it is not advisable to compare the point biserial of items across exams. A point biserial of .1 may be great for one test and really yucky for another. In addition, a point biserial of .1 for the hard items of a test may be great while it may be really yucky for the easy items of the same test.

We recommend that test makers use the point biserial as a way of identifying items that should be reviewed by subject matter experts for potential problems. We have found that items with low point biserials tend to have at least one of the following problems:

  • A debatable correct answer
  • More than one correct answer
  • No real correct answer
  • An unclear stem (The question text)

By calculating and using the point biserial, test makers can identify problematic items. From there they can collaborate with subject matter experts to improve the quality of their tests one item at a time. Look for future posts on other item quality metrics such as item fit statistics.

B…