The Point-Biserial Correlation Coefficient
One common metric used to assess item quality is the point biserial correlation coefficient (rpb). The “pt bis” as it is sometimes called is the correlation between an item score (1/0) and the total score on a test. Positive values are desirable and indicate that the item is good at differentiating between high ability and low ability examinees.
As straightforward as the concept may seem, there are a handful of different flavors of the point biserial. The most basic flavor relates the examinee’s item scores with their total RAW scores on the test. A slightly more robust version of this, called the corrected point biserial, calculates the relationship between the item score and the test score after removing the responses of the item in question to the total score. This flavor is very important for short tests where one item can dramatically impact the total score. Another flavor is the point-to-measure correlation which correlates the item score with the IRT ability estimate rather than the raw score. The point-to-measure correlation can come in both corrected and uncorrected forms.
Since test makers rely very heavily on the pt bis as a measure of item quality, it is important to remember that the point-biserial is greatly affected by restriction of range issues in the total test score (or measure). That is, when the total scores do not vary greatly, then the value of the point biserials is smaller than the actual relationship. This is most noticeable in difficult items where the total scores of people who answered an item correctly tend to be quite similar (in fact, they are often close to perfect scores).
Given that the size of a point biserial is greatly impacted by the variance in the scores of the examinees, it is not advisable to compare the point biserial of items across exams. A point biserial of .1 may be great for one test and really yucky for another. In addition, a point biserial of .1 for the hard items of a test may be great while it may be really yucky for the easy items of the same test.
We recommend that test makers use the point biserial as a way of identifying items that should be reviewed by subject matter experts for potential problems. We have found that items with low point biserials tend to have at least one of the following problems:
- A debatable correct answer
- More than one correct answer
- No real correct answer
- An unclear stem (The question text)
By calculating and using the point biserial, test makers can identify problematic items. From there they can collaborate with subject matter experts to improve the quality of their tests one item at a time. Look for future posts on other item quality metrics such as item fit statistics.
B…
June 30th, 2009 at 1:12 pm
This was a great post, Bri. It’s funny that this entry mirrored one of my last class lectures when I was reviewing students’ items for the tests (it was my assessment class– what a hoot).
That said, I wonder how often point biserials are used in the of the state tests on this sorry side of the continent. While there are so many problems with the tests the schools are using, too many of the test items are poorly constructed.
July 1st, 2009 at 8:50 am
Agree, great post. In the guidelines for publication, the Journal of Applied Measurement makes a case that Point Biserials (without mentioning the nuances you outline) must be in the .70 range to be considered useful (and not less than .30’s). This is a different guideline than your heuristic above. Reaction?
July 1st, 2009 at 11:38 am
In response to Matt. I’ve never ever witnessed a point-biserial over .50. It’s almost mathematically impossible to have one of those considering the restriction of range issues with a dichotomy (the item’s score). In reviewing the JAM article, it appears that they are discussing the values associated with rating scale items not dichotomous items. This presents a very different situation since an ordinal variable (the rating scale item score) is correlated with a continuous variable (The total test score). In this case, a Spearman correlation coefficient could be calculated but not a point-biserial correlation coefficient. I would expect the range of values for a Spearman correlation to be much higher than a point-biserial, and this is directly related to a greater range in the values of a polytomy (0,1,2,3…X) than a dichotomy (0,1).
This also raises another excellent point. In some circles, the point-biserial is called the Item-Total Correlation, which could be more completely called the Item Score-Total Score Correlation. I like this term better than the point-biserial correlation since it conveys what is being correlated rather than the statistic that is being used. When an item is a dichotomy, the item=total correlation would be calculated using the point-biserial. When an item is a polytomy, then it would be calculated using the Spearman.
July 10th, 2009 at 9:54 am
Here’s a thought: Would it be possible to correct for restriction of range in the total test score? This would not have to be exact, as the p-biserial correlation is used mostly as an index of suspicion. It is always best to then look at the responses and the item itself.
August 21st, 2009 at 9:48 am
Hello,
You guys are much more sophisticated than I, so please forgive my naïveté.
I have used the p-biserial to investigate the correlation between a continuous variable and a dichotomous outcome. I’m using this as a very preliminary method. However, based upon your discussion about its use in test/item analysis, I wonder if I should even use it for that.
Any thoughts? Thanks.
September 2nd, 2009 at 4:34 pm
Yes, by all means the point biserial is a very useful statistic in helping you to determine the quality of a test item. There are also many other statistics such as fit statistics and item response time statistics that will help you out. We’ll post more info on these at some point in the future.
December 22nd, 2009 at 7:06 am
so do you treat the point-to-measure version the same as the “regular” version? — that is, if my rule is “review anything under .15”, say, would I apply the same rule? Also, I would like to know how the ptme version compares to the IRT parameter — is it “a”, I forget now? thanks. Oh, and I love your blog! 🙂
February 1st, 2010 at 11:16 pm
Great questions Deb! I have found that the actual values of the ptme are typically higher than the traditional pt bis. If you are interested in converting an item selection policy from one stat to the other, I would conduct a research study to determine the appropriate threshold values. We’ve done this before and would be happy to help out.
May 10th, 2010 at 6:39 pm
Dr Brian
I like your MMblog for the fine contents.
I have one query about one interesting findings we got.
We categorized 1115 A-type MCQs in three groups depending on their difficulty level (high, medium and low). The effect of item difficulty was analyzed vis-a-vis various discrimination indices (discrimination index, DI; point biserial correlation, RBP; corrected RPB; biserial correlation, RBIS and corrected RBIS). The correlation was determined between item difficulty and discrimination indices.
The item difficulty significantly influenced all discrimination indices (p <0.001). The items with higher difficulty had significantly lower degree of discriminating power. The item difficulty-discrimination index plot showed inverted U shape relationship. The RPB showed significant polynomial with item difficulty (p <0.001). However, the RBIS showed significant linear relationships with item difficulty (p <0.001).
My conclusion is that Items in low difficulty range ( more than 0.7) are best at differentiating the high and low achievers as revealed by item-total correlation indices.
Any reaction ?
Also please tell what could be an explanation (This is contrary to belief that that difficult item are good for discrimination).
—
Dr. KK Deepak MD, PhD, MNAMS
Professor of Physiology,
Medical Education Unit,
Item Bank Administrator ( IDEAL Consortium ) College of Medicine,
University of Dammam, Dammam
Kingdom of Saudi Arabia
( on assignment till Sept 2010 )
May 10th, 2010 at 8:04 pm
This is an excellent question! One interesting artifact of the point biserial correlation coefficient (item-total correlation) is that easier items tend to have higher values than more difficult items. This tends to be true for all MCQ exams. This is not because easy items are better items. It’s simply an artifact of the statistic with MCQ items. Why is this so? I don’t have a definitive answer, but I think that it has something to do with guessing. It’s more likely that lower performing students will guess an item correct than higher performing students will miss an item due to carelessness. Nonetheless, the more extreme a p-value (due to being either very easy or very hard), the smaller the point-biserial may be due to restriction of range issues. Really, the point-biserial works best when the p-value of a non-MCQ item is approximately 0.5.
June 7th, 2010 at 11:00 pm
I have to agree with you Brian. Understanding the fundamentals of the correlation formula puts this into perspective. The numerator of various correlation formulas is covariance. The denominator consists of the product of the variance terms for the item and test. In order to have covariance, it is necessary to have variance both for the dichotomous item and the test. For the dichotomous item, variance is p*q, where p is the proportion correct and q is 1 – p. Thus, variance is maximized at .25 when p = .5, i.e., variance = .5 * .5. As the item deviates from the p = .5 and q = .5, variance and covariance decrease, which decreases the magnitude of the correlation in general. When variance is maximized, covariance is maximized, which explains the inverted U pattern between difficulty and discrimination described by Dr. Deepak. This is the general behavior. However, more difficult items can be more discriminating than easy items (e.g., speeded tests).
Patterns of relations can also change when test content is confounded with difficulty [e.g., (A) solving a one variable algebraic expression early in the test versus (B) reading and understanding a word problem, formulating a one variable mathematical expression from text and then solving the one variable algebraic expression later in the test]. Difficulty is confounded with content. Moreover, one can easily argue that the math test is multidimensional, involving verbal ability, analytical reasoning, and mathematical ability. In this case, the latter item can be broken down into three factors that combine into a linear composite when one score is reported. The early question only loads on one component in the composite and results in a lower point biserial. To the extent that verbal, analytical, and math are positively related in the latter question, then a higher point biserial correlation results because there are three positively related components producing the higher correlation. I hope this helps.
May 16th, 2012 at 10:31 am
Brian,
I have a question about point biseral correlation. I want to measure the correlation between gender and frequency of attending church. I set male=1 and female=2 in SPSS and did the correlation analysis. The output showed negative value of the coefficient. Should I interpret the result as female has lower frequency and male have higher frequency. Also as the percentage of male and female are 28% and 72% respectively, will this uneven distribution affects the calculation of coefficient.