U.S. Supreme Court Rules in Favor of Valid Tests

Brian D. Bontempo, PhD.
Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

There have not been too many court cases related to the testing industry that are commonly cited as landmark cases. It is for this reason that the recent decision by the U.S. Supreme Court in The Ricci, et. al. vs. DeStefano, et. al. case may be cited for years to come. In this case, the Supreme Court ruled in favor of some New Haven, CT firefighters who met the qualifications for promotion to Captain or Lieutenant based on test results but were denied the promotion. After analyzing the test results, the test sponsor (The city of New Haven, CT) discovered that the white candidates had performed better than the minorities and decided to use an alternative method to determine who would be awarded promotion. It was revealed during the case that the exam was constructed using the results of a job task analysis, the items were written and validated by qualified subject matter experts and the scoring model was developed in a scientifically sound manner. From the information provided, it seems as though the exam was psychometrically sound. Therefore, it is monumental that the Supreme Court supported the use of the test results despite the poor performance of minority candidates.

One thing that I find quite interesting about this case is that no one considered or conducted an empirical study to determine if the test was, in fact, biased. If you’re not an advanced test maker, then you may be saying, why would this be necessary? It’s clear that the minority candidates didn’t do as well as the white candidates, doesn’t that clearly demonstrate that the test is biased? The answer to that question is an emphatic No. I’ll illustrate why through the following example. Lets imagine a classroom teacher that separated his/her class into two groups based on her/his perception of the students’ ability. One group was comprised of the high ability students and the other of the low ability students. The teacher then administered a test to the two groups. The results came back as expected. The group that the teacher perceived to be high ability outperformed the low group. Would we conclude that the test is biased? No.

So what can we do to determine if low performance is due to a biased test?

Modern psychometric techniques allow us to use the item level test data to determine if there are any test items (questions) that are biased. These techniques, called Differential Item Functioning (DIF), allow test makers and test users to objectively identify items that are more difficult for one subpopulation than another. These analyses are more in-depth than simply comparing the percentage of majority candidates that answered a question correctly to the percentage of minority candidates that did so. Although there are several different approaches to DIF analyses, the two most common are as follows:

  • The Classical Test Theory approach. This technique separates the candidates from each subpopulation into groups based on their ability and compares the performance of the majority/minority group at each ability level. See the table below. In this situation, the percentage of majority candidates answering the question correctly was 76% compared to 70% for the minority. However, when you look at the question by ability level, it seems as if the two groups are performing pretty similarly.
    DIF-item1
    DIF-item1

    Contrast that to a different test item where the overall percentage of candidates answering the item correctly was again 76% for majority and 70% for the minority.  The difference between the Majority and Minority percent correct is about 5% for each of the ability groupings.  Quite obviously, there is something distinctly biased about this test item.

    DIF-Item2
    DIF-Item2
  • The IRT approach. Although somewhat more difficult to understand, this approach is actually easier to calculate. To conduct a DIF analysis with IRT, simply calibrate the test twice once with each subpopulation and compare the item difficulty estimates. I recommend comparing the results by plotting the item difficulty estimates derived from one subpopulation against the other. The result is typically a fairly straight diagonal line. Biased items appear in spots different than the line.

Both of these techniques work quite well at identifying items that are biased. Once identified, test makers should review, revise and/or remove items that are showing signs of bias. If there are a large number of these items, then this a sign that the test itself is biased and should be overhauled completely.

In conclusion, if you are ever faced with a potential test bias issue, run a DIF analysis before taking your next step. In the case of the firefighters, this information may have helped the city of New Haven to defend its assessment program against test bias criticism.

B…

Comments (1)

The Point-Biserial Correlation Coefficient

Brian D. Bontempo, PhD.
Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

One common metric used to assess item quality is the point biserial correlation coefficient (rpb). The “pt bis” as it is sometimes called is the correlation between an item score (1/0) and the total score on a test. Positive values are desirable and indicate that the item is good at differentiating between high ability and low ability examinees.

As straightforward as the concept may seem, there are a handful of different flavors of the point biserial. The most basic flavor relates the examinee’s item scores with their total RAW scores on the test. A slightly more robust version of this, called the corrected point biserial, calculates the relationship between the item score and the test score after removing the responses of the item in question to the total score. This flavor is very important for short tests where one item can dramatically impact the total score. Another flavor is the point-to-measure correlation which correlates the item score with the IRT ability estimate rather than the raw score. The point-to-measure correlation can come in both corrected and uncorrected forms.

Since test makers rely very heavily on the pt bis as a measure of item quality, it is important to remember that the point-biserial is greatly affected by restriction of range issues in the total test score (or measure). That is, when the total scores do not vary greatly, then the value of the point biserials is smaller than the actual relationship. This is most noticeable in difficult items where the total scores of people who answered an item correctly tend to be quite similar (in fact, they are often close to perfect scores).

Given that the size of a point biserial is greatly impacted by the variance in the scores of the examinees, it is not advisable to compare the point biserial of items across exams. A point biserial of .1 may be great for one test and really yucky for another. In addition, a point biserial of .1 for the hard items of a test may be great while it may be really yucky for the easy items of the same test.

We recommend that test makers use the point biserial as a way of identifying items that should be reviewed by subject matter experts for potential problems. We have found that items with low point biserials tend to have at least one of the following problems:

  • A debatable correct answer
  • More than one correct answer
  • No real correct answer
  • An unclear stem (The question text)

By calculating and using the point biserial, test makers can identify problematic items. From there they can collaborate with subject matter experts to improve the quality of their tests one item at a time. Look for future posts on other item quality metrics such as item fit statistics.

B…

Comments (12)

What do you get from a single test item?

Brian D. Bontempo, PhD.
Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

If an examinee answers a single test item (aka question) about a specific topic correctly, what can we infer about the examinee’s knowledge of that topic?  Even with the highest quality performance assessment items, the answer is very little. That’s because there are several factors that interact when a test item is presented to an examinee such as the readability of the item, examinee motivation, or prior knowledge of the item possessed either by cheating (see Caveon for more information about cheaters) or by having a very small set of performance assessment items.

Therefore, it is useful to think about knowledge (or competency, skill, or ability) using the framework of probability. In the absence of any other information, we can infer that the probability of an examinee possessing the knowledge of a specific topic is high if the examinee answers a test item about it correctly. The corollary in corporate training would be, we can infer that the probability of an examinee possessing the skills to perform a specific task is high if the examinee performs the task correctly in an assessment situation.

Now, there are many things that can affect this probability. The fidelity of the item is certainly important. Multiple-choice questions are often bashed by test haters because their fidelity is low making it quite challenging to infer that an examinee knows something about a topic from a single response. Performance assessment items are often praised because they have a much higher level of fidelity thereby increasing the probability.

No matter what type of item is used, the probability will never reach 100%. It is this fact that makes it fundamentally necessary to approach testing from a probabilistic perspective as Item Response Theory (IRT) and Rasch Measurement Theory do. This fact also makes it imperative to collect more information or ask more test questions. Sometimes this means that test questions begin to look quite similar. I remember my 3rd grade math book where the exercises required the same computations over and over again using different numbers. Sometimes, I answered 90% of the items correctly about a topic, having missed a few items due to careless errors. My teacher was able to infer that I understood the topic in this situation. Other times, I did quite poorly, answering very few items correctly. This time the teacher easily inferred my lack of knowledge about the topic. There were some occasions where I answered a good number correctly but still missed some, answering around 75% of the items correctly. Sometimes, my teacher could inspect my responses and identify a trend, such as a specific area where I lacked the knowledge. Other times, she could not. In these situations, I typically possessed the knowledge but applied my knowledge inconsistently due to tiredness, boredom, or carelessness.

In all three of these situations, it should be clear that the information gleaned from one test item is not enough to infer mastery of the topic. This leads us to the next logical question, how many questions should a test maker ask about a specific topic? It doesn’t really matter! What does matter is that you ask enough questions about the entire content domain to make valid inferences about the examinee’s knowledge of the entire subject rather than a specific topic. If you’re really interested in an examinee’s topic knowledge, then you’ll need to build a test for every single topic, which might look a lot like the 3rd grade math test discussed earlier where the test items were very similar. Although this type of test might seem reiteratively, recursively, repetitively, redundant (uh huh!), it will not only succeed in achieving better measurement, it will help the examinees to develop procedural memory. After all, practice makes perfect.

B…

Comments (2)

A Criterion-Referenced Test is NOT a Mastery Test

Brian D. Bontempo, PhD.
Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

This week’s entry is inspired by a Figure in a widely publicized book by Sharon A. Shrock and William Coscarelli called Criterion Referenced Test Development: Technical and Legal Guidelines for Corporate Training. I have not completed reading this book, but I find the authors perspective to be an interesting departure from the traditional testing paradigms (which equals 20 cents for those of you keeping score). I would like to thank them for stretching my mind and inspiring me to think differently about testing in corporate settings.

I would like to take this opportunity to continue the stretching by expanding on a topic they discuss in chapter two, which defines both criterion-referenced testing (CRT) and norm-referenced testing (NRT). A norm-referenced test (NRT) is one in which test outcomes (e.g., grades or pass/fail) are determined based on each examinee’s score relative to the other examinees. Although this practice is uncommon (and arguably unethical) it is occasionally still used today. For example, some of the state Bar exams are norm-referenced. Typically, the top X percent of examinees are awarded a passing mark, regardless of how competent or incompetent the group of test takers was that took the exam together. In other words, if a prospective lawyer was to take an exam along with the most competent group of graduates, then (s)he would have less chance achieving a passing mark than (s)he would have if (s)he took the exam alongside a group of bottom feeders. Does it surprise you that the legal profession would endorse something out-dated, scientifically unsupported and arguably unethical?

On the other hand, a criterion-referenced test (CRT) is a test composed of specific objectives, or competency statements. This type of test is common in licensure and certification. The passing rates for CRTs vary with each test cycle since examinees are evaluated based on their competency relative to a criterion-referenced passing standard (aka cutscore). There are many other attributes of these two types of tests beyond their scoring methodology, and I’ll leave it up to future posts to expand upon these.

One other type of test that Shrock and Coscarelli refer to is a mastery test, a test where most examinees answer the vast majority of the content correctly. K-12 classroom tests are commonly designed this way. The distribution of scores for a mastery test looks similar to this (Insert distribution). I think that it is important to point out that mastery tests are a form of criterion-referenced tests. In other words, Criterion-Reference Test Mastery Test. See below for a visual representation of this.

Types of Tests
Types of Tests

So, what do we call a non-mastery CRT? To be honest, I don’t know. I have heard people refer to them as non-mastery tests or non-mastery, criterion-referenced tests.

Mastery tests are useful in the corporate training world where the content domains are small (typically measured in class hours) and the shelf-life of the training programs and tests are generally short (measured in months or years). However, they are NOT optimal for certification (corporate or non-corporate).

Mastery Test Curve
Mastery Test Curve

Why should a corporation build a non-mastery, criterion-referenced test? There are two primary reasons.

  1. If constructed properly, non-mastery, criterion-referenced tests provide more information than a simple pass/fail result.  Non-mastery, criterion-referenced tests are competency measurement instruments. Just as a ruler measures the length of an object, a non-mastery, criterion-reference test can measure the competency of an individual. This ruler can be used to measure the competency of individuals or the difficulty of the test questions which can provide valuable feedback to the training program or corporation.
  2. When the level of mastery changes, it is much easier to change the level of competency required to achieve mastery, than it is to write new content or a whole new exam.

Comments (4)

Item Exposure Control Mechanisms

Brian D. Bontempo Ph.D - MMlog
Brian D. Bontempo Ph.D - MMlog

Brian D. Bontempo, Ph.D

A few weeks ago, I had the privilege of making the annual trek to the AERA/NCME annual meeting. Typically, I’m quite cynical about this event, but this year I was pleasantly surprised. The program was well done, the presentations were pretty darn good, and the people I chatted with stimulated me. All in all, I came back pretty inspired. I’d like to give thanks to Kathy Gialluca and Daniel Bolt, the NCME program chairs and to Barbara Dodd and Chad Buckendahl who served as moderators and discussants in some of my sessions.

Today’s topic is an advanced topic and is a follow-up to the NCME session I discussed on Item Exposure Control Mechanisms in CAT (Computerized Adaptive Testing). This is a topic that is a real yawner for me because its typically business and operational motivations that dictate choice in item exposure rather than academic or psychometric rigor. Before diving in, here’s a very brief history of item exposure control mechanisms.

In the old days (circa 1980s), the CAT community referred to this as Item Selection, not Item Exposure Control. At the time, the algorithms (e.g., Kingsbury, G.G., & Zara, A.R. (1989) all contained an element of randomization. And they all worked quite well. However, there were a number of testing programs using the Two-Parameter Logistic (2PL) or Three-Parameter Logistic (3PL) models (for more information on Item Response Theory – IRT see future posts) that had ‘less than stellar’ item pools. For these programs, there were a number of items that were always selected because they were stellar performers while others were never selected because they were low quality items. Somehow, these programs got the idea that they could improve their exams by doing some complex modeling and finding a way to select these items. Thus was born the Sympson-Hetter item selection algorithm (Sympson, J. B., & Hetter, R. D. (1985). This method succeeds in establishing a maximum threshold at which an item cannot be selected again, but is extremely cumbersome to implement and sucks much of the efficiency out of a CAT. About a decade later, some Spanish researchers proposed the Progressive Method (Revuelta, J. & Ponsada, V. (1998). This method essentially selects items at random at the beginning of the test and slowly but surely turns into a maximum likelihood CAT by the end of test.

The item exposure session at NCME was an academic comparison of these using simulated data for small item pools. I had a good time reading the papers. I like being a discussant. It forces you to read papers you wouldn’t otherwise read. Moreover, it forces you to say something constructive about them (hah!). No offense to the authors. I have three points to make.

During my time as a discussant, I asked the NCME crowd what item control mechanism they were using, and I discovered that the Sympson-Hetter is slightly more popular than the randomization techniques and only a few folks are using the progressive method. I was thoroughly shocked by this. I expected most folks to be using randomization techniques. This tells me one thing…mathematically inclined Psychometricians have a habit of wasting their clients time and dollars because they enjoy mathematically modeling things that today’s computers can collect empirically just as quickly.

Point #2…Item exposure control is only necessary for programs that can’t build a good item pool. In the four papers I discussed, the authors had REALLY BAD item pools. In one, only 51 items were selected from a 176 item pool for a 40 item test. Hmmmm…why build a CAT in this situation? Two fixed forms will perform better than a CAT.

The moral of the this point is that ALL testing professionals need to be concerned with building better item pools. This fixes item exposure issues more than any control mechanism available.

Point #3…I note that One Parameter Logistic Model, or 1PL (Rasch), CATs yield flatter item exposure. That’s one more reason to implement the Rasch model for testing programs moving to CAT.

B…

Comments (4)

Welcome to Mountain Measurement’s Blog

Hi!

This site is meant to be a casual, fun, and friendly braindump of the Mountain Measurement thought farm, our term for a sustainable think tank.

Comments (1)

© 2009-2017 Mountain Measurement, Inc. All Rights Reserved -- Copyright notice by Blog Copyright