Item Exposure Control Mechanisms
A few weeks ago, I had the privilege of making the annual trek to the AERA/NCME annual meeting. Typically, I’m quite cynical about this event, but this year I was pleasantly surprised. The program was well done, the presentations were pretty darn good, and the people I chatted with stimulated me. All in all, I came back pretty inspired. I’d like to give thanks to Kathy Gialluca and Daniel Bolt, the NCME program chairs and to Barbara Dodd and Chad Buckendahl who served as moderators and discussants in some of my sessions.
Today’s topic is an advanced topic and is a follow-up to the NCME session I discussed on Item Exposure Control Mechanisms in CAT (Computerized Adaptive Testing). This is a topic that is a real yawner for me because its typically business and operational motivations that dictate choice in item exposure rather than academic or psychometric rigor. Before diving in, here’s a very brief history of item exposure control mechanisms.
In the old days (circa 1980s), the CAT community referred to this as Item Selection, not Item Exposure Control. At the time, the algorithms (e.g., Kingsbury, G.G., & Zara, A.R. (1989) all contained an element of randomization. And they all worked quite well. However, there were a number of testing programs using the Two-Parameter Logistic (2PL) or Three-Parameter Logistic (3PL) models (for more information on Item Response Theory – IRT see future posts) that had ‘less than stellar’ item pools. For these programs, there were a number of items that were always selected because they were stellar performers while others were never selected because they were low quality items. Somehow, these programs got the idea that they could improve their exams by doing some complex modeling and finding a way to select these items. Thus was born the Sympson-Hetter item selection algorithm (Sympson, J. B., & Hetter, R. D. (1985). This method succeeds in establishing a maximum threshold at which an item cannot be selected again, but is extremely cumbersome to implement and sucks much of the efficiency out of a CAT. About a decade later, some Spanish researchers proposed the Progressive Method (Revuelta, J. & Ponsada, V. (1998). This method essentially selects items at random at the beginning of the test and slowly but surely turns into a maximum likelihood CAT by the end of test.
The item exposure session at NCME was an academic comparison of these using simulated data for small item pools. I had a good time reading the papers. I like being a discussant. It forces you to read papers you wouldn’t otherwise read. Moreover, it forces you to say something constructive about them (hah!). No offense to the authors. I have three points to make.
During my time as a discussant, I asked the NCME crowd what item control mechanism they were using, and I discovered that the Sympson-Hetter is slightly more popular than the randomization techniques and only a few folks are using the progressive method. I was thoroughly shocked by this. I expected most folks to be using randomization techniques. This tells me one thing…mathematically inclined Psychometricians have a habit of wasting their clients time and dollars because they enjoy mathematically modeling things that today’s computers can collect empirically just as quickly.
Point #2…Item exposure control is only necessary for programs that can’t build a good item pool. In the four papers I discussed, the authors had REALLY BAD item pools. In one, only 51 items were selected from a 176 item pool for a 40 item test. Hmmmm…why build a CAT in this situation? Two fixed forms will perform better than a CAT.
The moral of the this point is that ALL testing professionals need to be concerned with building better item pools. This fixes item exposure issues more than any control mechanism available.
Point #3…I note that One Parameter Logistic Model, or 1PL (Rasch), CATs yield flatter item exposure. That’s one more reason to implement the Rasch model for testing programs moving to CAT.
B…
May 7th, 2009 at 8:46 am
Three cheers for your third point. Further, if you have sufficiently large numbers of Rasch-calibrated items in your bank (e.g. multiples within a few tenths of logits of each other along the full range), you’ll have even less trouble with item exposure control.
May 8th, 2009 at 8:51 am
Welcome to the blogging world, my friend! This is very exciting. Of course I have no idea what you are talking about…
April 23rd, 2010 at 12:01 pm
I’m just now getting into the area of item exposure control. I think it is a great area of research, especially considering issues of item security within testing companies and organizations hiring employees in emerging markets. As an industrial and organizational psychologist, I helped IBM solve some test security concerns a few years back using artificial intelligence to generate parallel tests for paper and pencil administrations in Asia-Pacific. Now most of my work is focused on the computer adaptive methods.
On Point 3, I concur to a certain degree about the items calibrated according to the Rasch model. However, the Rasch model does not fit most data, even though the model has some very desirable characteristics in terms of (1) utilization rate of the item pool, (2) balanced item exposure rates across items, and (3) accuracy in estimating theta. The desirable characteristics deteriorate fast once you move away from the Rasch model and when you begin to apply Sympson-Hatter or even the Progressive Method, i.e., you lose accuracy in estimating theta given the same amount of items administered. Is that efficient? Probably not!
On Point 2, you are right on with 51 items being used in a pool of 176. A utilization rate of 29% [(51/171)*100] is inefficiency at its finest. Just imagine if a psychometrician is hired at $100 per operational item (a very conservative estimate), that’s $12,500 down the drain on items that won’t even see the operational exposure light of day using their algorithm. That is not to mention the time writing the items and preparing them for operation; that’s too much waste.
I’ve been doing a few simulation in my company with a few item exposure control methods and reasoned that all of the aforementioned methods and fixes (simple and complex) are inefficient based upon one criteria or the next (e.g., item pool utilization rate, maximum item exposure rates, or accuracy in estimating theta with the 2-PL, 3-PL or even the multidimensional extensions of the 2-PL and 3-PL models). I’ve reasoned that as long as you are using maximum information as a criterion, putting a bandage on the algorithm will not ultimately solve the problem. In one way or another, you will sacrifice accuracy in measurement for balanced item exposure and high item pool utilization rates.
I’ve derived a game changer and would be willing to discuss them if you are interested.
I’ve enjoyed your post. It was refreshing even though it is about a year old, lol!
Regards, DB
June 22nd, 2010 at 4:45 pm
Adaptive Assessments recently authorized for public distribution a simulation study on item selection methods and item exposure control for the 3-PL model, which replicated some of the results you discussed above for maximum information (e.g., item pool utility rates less than 35% and item exposure rates greater than 85% for maximum information item selection). I’ve provided a link to the paper. I’m presenting these results at a conference or two in the near future with multidimensional item adaptive testing and item exposure control. Perhaps we could do a session or two. Let me know! DB
http://www.drdamonbryant.com/AAS_Proof_of_Concept.pdf