What do you get from a single test item?
If an examinee answers a single test item (aka question) about a specific topic correctly, what can we infer about the examinee’s knowledge of that topic? Even with the highest quality performance assessment items, the answer is very little. That’s because there are several factors that interact when a test item is presented to an examinee such as the readability of the item, examinee motivation, or prior knowledge of the item possessed either by cheating (see Caveon for more information about cheaters) or by having a very small set of performance assessment items.
Therefore, it is useful to think about knowledge (or competency, skill, or ability) using the framework of probability. In the absence of any other information, we can infer that the probability of an examinee possessing the knowledge of a specific topic is high if the examinee answers a test item about it correctly. The corollary in corporate training would be, we can infer that the probability of an examinee possessing the skills to perform a specific task is high if the examinee performs the task correctly in an assessment situation.
Now, there are many things that can affect this probability. The fidelity of the item is certainly important. Multiple-choice questions are often bashed by test haters because their fidelity is low making it quite challenging to infer that an examinee knows something about a topic from a single response. Performance assessment items are often praised because they have a much higher level of fidelity thereby increasing the probability.
No matter what type of item is used, the probability will never reach 100%. It is this fact that makes it fundamentally necessary to approach testing from a probabilistic perspective as Item Response Theory (IRT) and Rasch Measurement Theory do. This fact also makes it imperative to collect more information or ask more test questions. Sometimes this means that test questions begin to look quite similar. I remember my 3rd grade math book where the exercises required the same computations over and over again using different numbers. Sometimes, I answered 90% of the items correctly about a topic, having missed a few items due to careless errors. My teacher was able to infer that I understood the topic in this situation. Other times, I did quite poorly, answering very few items correctly. This time the teacher easily inferred my lack of knowledge about the topic. There were some occasions where I answered a good number correctly but still missed some, answering around 75% of the items correctly. Sometimes, my teacher could inspect my responses and identify a trend, such as a specific area where I lacked the knowledge. Other times, she could not. In these situations, I typically possessed the knowledge but applied my knowledge inconsistently due to tiredness, boredom, or carelessness.
In all three of these situations, it should be clear that the information gleaned from one test item is not enough to infer mastery of the topic. This leads us to the next logical question, how many questions should a test maker ask about a specific topic? It doesn’t really matter! What does matter is that you ask enough questions about the entire content domain to make valid inferences about the examinee’s knowledge of the entire subject rather than a specific topic. If you’re really interested in an examinee’s topic knowledge, then you’ll need to build a test for every single topic, which might look a lot like the 3rd grade math test discussed earlier where the test items were very similar. Although this type of test might seem reiteratively, recursively, repetitively, redundant (uh huh!), it will not only succeed in achieving better measurement, it will help the examinees to develop procedural memory. After all, practice makes perfect.
B…


May 28th, 2009 at 12:42 pm
Well said, Brian!
All I would add are two notes, one on reliability and one on the context of the application. As to reliability, hold it a minute! It doesn’t really matter how many questions you ask, but what does matter is that you ask enough questions to make valid inferences? I think this is called a non sequitur, right? The relationship between the number of questions asked (and associated rating categories, if used) and the resulting reproducible statistical distinctions (given the standard deviation) is well known. It’s also very useful to have in hand when designing instruments, so that you’ll be able to support the validity of the inferences you want to make. For more on this, see Linacre’s classic Rasch generalizability theory nomograph (http://www.rasch.org/rmt/rmt71h.htm).
Concerning context, screening, which certainly does not provide information relative to mastery, might effectively be performed with well-chosen single items, whereas diagnostic (integrated assessment and instruction), accountability, or research applications could not be. For more on the measurement properties of single-item screening items, see:
DeSalvo, K., Fisher, W. P. J., Tran, K., Bloser, N., Merrill, W., & Peabody, J. W. (2006, March). Assessing measurement properties of two single-item general health measures. Quality of Life Research, 15(2), 191-201.
May 28th, 2009 at 1:41 pm
William,
Good to hear from you. I added some bolding to show the distinction between number of items per topic and number of items per test. Maybe that will clarify my point here.
Thank you for directing folks to the concept of reliability which is fundamentally important in any test development. I’m planning on writing a future Blog about test length so we can address these topics then.