Why would a long test have a low reliability?
Introduction: Recently, we ran a reliability analysis for a client that is worth sharing. The certification exam had about 400 items and the reliability came in under .80 as measured by Cronbach’s Alpha (Cronbach, 1951), an undesirable if not unacceptable reliability for any test let alone a test containing 400 items. (NOTE: Mountain Measurement didn’t design or build this exam; we simply analyzed the data in an effort to begin helping the client).
Background: Reliability is a psychometric indicator of test quality. There is a direct relationship between test length (number of items) and the reliability. Many 50-60 item tests achieve a reliability that exceeds .80. Given that this exam contained hundreds of items, we set out to determine the cause of the low reliability.
Possible Cause #1- Low Item Quality: We calculated the item-test score total correlation and discovered that the vast majority of the items had positive point-biserial correlation coefficients revealing that the items were good.
Possible Cause #2- Misalignment between the difficulty of the test items (questions) and the candidates taking the exam: We calculated the distribution of test scores and in this, we found some problems. We discovered that the mean raw score was quite high (about 325) and that the standard deviation was quite small (around 15). The scores were so high that no-one scored below a 200.
When designing a test, the items should be targeted to the examinees. This maximizes the efficiency of the test. For our client, the 200 easiest items of this exam were adding absolutely nothing to the assessment. They were simply wasting the examinees’ time. Moreover, based on the standard deviation there were only about 60 items that were providing measurement information. All others were really too easy or too hard to be providing much information at all. In essence, this was a 60 item test which explains why the reliability was around .80.
Recommendation: In order to raise the reliability of this exam, we advised the client to write more items that would be similar in difficulty to the 60 items that were best targeted to the examinees based on their empirical performance.
April 23rd, 2010 at 12:32 pm
That’s incredible! Four hundred items with only 60 items discriminating in the ability distribution of interest. You are absolutely right, a waste of time and money. The time explanation is obvious, e.g., one item may take 30 seconds to answer (very conservative estimate). With 340 unnecessary items, you are talking about almost 3 hours wasted per person, that’s 1/8 of a day! Multiply that by the number of persons taking the exam in a year and once again, all those person hours are gone for no defensible psychometric reason. For example, if they tested 1000 people a year (and if they have the funds to hire a consultant to help them with that exam, that is probably too low of a number), that is about 125 person days wasted; that is over 1/3 of the year!
In terms of money, the said thing about it is that this certification exam is probably not free. Let’s assume that you are taking this exam in a testing center or even hiring a proctor and renting hotel conference space. If you are taking this exam in a testing center and the negotiated contract between the test center and the test sponsor or certifying agency is negotiated on an hourly rate, guess how much money is lost? As a very conservative estimate, the rate for the test center is $10 per testing hour per person. I personally know of rates that far exceed this value but it’s very conservative and can demonstrate the point. Going back to our very conservative 1000 examinees per year who have wasted 125 person hours, we see that over $1000 dollars is gone. That’s only with 1000 examinees per year. Please don’t tell me the actual number of examinees per year; I think I might cry. You get my point though!
Once again, I’ve enjoyed your post. Keep up the good work!
Regards,
DB
March 28th, 2012 at 5:14 am
Hi Brian,
I know it’s been a while since you posted this piece but I’ve come across a similar problem myself and was wondering whether you could help me?
I’m looking at a 330 item test which had a KR-20 score of 0.722 – pretty low. The point biserials are pretty low across the board (but that’s an issue across the bank of questions) but there are a lot of questions with high p-values and the range of total scores is quite narrow.
I was wondering how you came to the conclusion that a question is adding nothing to the assessment – is this based on the p-values? Also how are you applying the standard deviation to the question data (p-values again?) to define whether questions provided ‘measurement information’?
Thanks,
Neil