Brian D. Bontempo, PhD.
Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

There have not been too many court cases related to the testing industry that are commonly cited as landmark cases. It is for this reason that the recent decision by the U.S. Supreme Court in The Ricci, et. al. vs. DeStefano, et. al. case may be cited for years to come. In this case, the Supreme Court ruled in favor of some New Haven, CT firefighters who met the qualifications for promotion to Captain or Lieutenant based on test results but were denied the promotion. After analyzing the test results, the test sponsor (The city of New Haven, CT) discovered that the white candidates had performed better than the minorities and decided to use an alternative method to determine who would be awarded promotion. It was revealed during the case that the exam was constructed using the results of a job task analysis, the items were written and validated by qualified subject matter experts and the scoring model was developed in a scientifically sound manner. From the information provided, it seems as though the exam was psychometrically sound. Therefore, it is monumental that the Supreme Court supported the use of the test results despite the poor performance of minority candidates.

One thing that I find quite interesting about this case is that no one considered or conducted an empirical study to determine if the test was, in fact, biased. If you’re not an advanced test maker, then you may be saying, why would this be necessary? It’s clear that the minority candidates didn’t do as well as the white candidates, doesn’t that clearly demonstrate that the test is biased? The answer to that question is an emphatic No. I’ll illustrate why through the following example. Lets imagine a classroom teacher that separated his/her class into two groups based on her/his perception of the students’ ability. One group was comprised of the high ability students and the other of the low ability students. The teacher then administered a test to the two groups. The results came back as expected. The group that the teacher perceived to be high ability outperformed the low group. Would we conclude that the test is biased? No.

So what can we do to determine if low performance is due to a biased test?

Modern psychometric techniques allow us to use the item level test data to determine if there are any test items (questions) that are biased. These techniques, called Differential Item Functioning (DIF), allow test makers and test users to objectively identify items that are more difficult for one subpopulation than another. These analyses are more in-depth than simply comparing the percentage of majority candidates that answered a question correctly to the percentage of minority candidates that did so. Although there are several different approaches to DIF analyses, the two most common are as follows:

  • The Classical Test Theory approach. This technique separates the candidates from each subpopulation into groups based on their ability and compares the performance of the majority/minority group at each ability level. See the table below. In this situation, the percentage of majority candidates answering the question correctly was 76% compared to 70% for the minority. However, when you look at the question by ability level, it seems as if the two groups are performing pretty similarly.
    DIF-item1
    DIF-item1

    Contrast that to a different test item where the overall percentage of candidates answering the item correctly was again 76% for majority and 70% for the minority.  The difference between the Majority and Minority percent correct is about 5% for each of the ability groupings.  Quite obviously, there is something distinctly biased about this test item.

    DIF-Item2
    DIF-Item2
  • The IRT approach. Although somewhat more difficult to understand, this approach is actually easier to calculate. To conduct a DIF analysis with IRT, simply calibrate the test twice once with each subpopulation and compare the item difficulty estimates. I recommend comparing the results by plotting the item difficulty estimates derived from one subpopulation against the other. The result is typically a fairly straight diagonal line. Biased items appear in spots different than the line.

Both of these techniques work quite well at identifying items that are biased. Once identified, test makers should review, revise and/or remove items that are showing signs of bias. If there are a large number of these items, then this a sign that the test itself is biased and should be overhauled completely.

In conclusion, if you are ever faced with a potential test bias issue, run a DIF analysis before taking your next step. In the case of the firefighters, this information may have helped the city of New Haven to defend its assessment program against test bias criticism.

B…