ATP Conference Preview

Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

Greetings!  It’s that wonderful time of the year, when testing professionals from around the world come together to commune, celebrate warm weather and engage in show and tell.  What am I talking about?  The annual Innovations in Testing Conference sponsored by the Association of Test Publishers.  Newbies to testing are encouraged to attend as are seasoned veterans.  I have been going to the ATP conference since it’s second year, and I find it to be the highlight of the year.

I do recommend every session in the program, even the ones on Marketing (which blew me away a few years back)!  Nonetheless, I took a few moments to take a deeper look at the conference program.  Here are some of my session recommendations.

1.) Putting the Social Back in Networking (No link available) - Tuesday Feb 9, 2010 at 3:45 PM - This should be a fantastic opportunity to play face-to-face social networking by mimicking the online tools that many of us use today.  I believe in the power of community and I would like to see more activity on the part of my fellow testing community members to tweet, facebook, and link-in.  This session will provide the inspirations for those who are nervous.

2.) The keynote address - I have read a little online about Scott Berkun and this guy has some great things to say about innovation.  Since the word innovation is in the title of the conference, I have high hopes for this session.

3.) Evaluating and Charting a Path for Innovation - This is a session in which I am presenting so consider yourself invited.  This session will provide participants with the time to think about and discuss their own innovation projects.  For anyone contemplating or undertaking enterprise level innovation, this session should provide the inspiration and space to formulate or organize your thoughts and action plan.

4.) Assessing the Hard Stuff with Innovative Items - There are a few new innovative item bank and item type providers out there right now.  This is a presentation by one of those groups, Atvantus.  Although it has been about a year since I extensively toured the item banks that are available, this group seems to be producing more advanced item types than many of the other groups.  If you’re into innovative items, this would be a good opportunity to see what they have to say.

5.) Automating the Item Review Process - The session covers entirely new ground about a topic that I am thinking more and more about each day.  The ground is natural language processing and the topic is item enemies.  Kirk’s a bright guy.  If you can keep up with him, I’m sure you’ll get a lot from this session.

I am also pretty excited by how much technological innovation is finally coming to testing.  It’s not really a surprise that it has taken over twenty years for it to get here.  Twenty years ago, the bucks were in consumer and business technology.  With the slow down in the economy, the testing industry has shined as a growing industry.  No wonder the technology providers are descending on our community.  Although I am very excited about this influx of talented people and the services they provide, I must caution all to not be over zealous about cool technology that may have little psychometric rigor with which to back it up.  A good test or item should have both face validity and psychometric soundness.  The ATP conference should provide a great forum for people interested in testing to learn more about these new technologies and the ways in which the technologies should be implemented to improve your assessment program.  Enjoy!

  • Share/Bookmark

Leave a Comment

Why would a long test have a low reliability?

Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

Introduction: Recently, we ran a reliability analysis for a client that is worth sharing. The certification exam had about 400 items and the reliability came in under .80 as measured by Cronbach’s Alpha (Cronbach, 1951), an undesirable if not unacceptable reliability for any test let alone a test containing 400 items. (NOTE: Mountain Measurement didn’t design or build this exam; we simply analyzed the data in an effort to begin helping the client).

Background: Reliability is a psychometric indicator of test quality. There is a direct relationship between test length (number of items) and the reliability. Many 50-60 item tests achieve a reliability that exceeds .80. Given that this exam contained hundreds of items, we set out to determine the cause of the low reliability.

Possible Cause #1- Low Item Quality: We calculated the item-test score total correlation and discovered that the vast majority of the items had positive point-biserial correlation coefficients revealing that the items were good.

Possible Cause #2- Misalignment between the difficulty of the test items (questions) and the candidates taking the exam: We calculated the distribution of test scores and in this, we found some problems. We discovered that the mean raw score was quite high (about 325) and that the standard deviation was quite small (around 15). The scores were so high that no-one scored below a 200.

When designing a test, the items should be targeted to the examinees. This maximizes the efficiency of the test. For our client, the 200 easiest items of this exam were adding absolutely nothing to the assessment. They were simply wasting the examinees’ time. Moreover, based on the standard deviation there were only about 60 items that were providing measurement information. All others were really too easy or too hard to be providing much information at all. In essence, this was a 60 item test which explains why the reliability was around .80.

Recommendation: In order to raise the reliability of this exam, we advised the client to write more items that would be similar in difficulty to the 60 items that were best targeted to the examinees based on their empirical performance.

  • Share/Bookmark

Leave a Comment

Automating Reports Using SQL and Excel

Dr. Michelle Amoruso

Michelle Amoruso, PhD.

If you need to provide data summaries on a weekly or daily basis, you might want to consider using Excel to query results directly from your database. This will allow you to produce and update standardized reports in seconds, with the flexibility to make rapid reports modifications (well, as fast as you can edit your SQL code). Sure, you can save your existing SQL queries and rerun them every time you need to produce a report. But why not save time and eliminate human error? Additionally, automating your reports production makes it easy to provide multiple levels of data that can be used to identify problems, track resolution, and monitor project goals.

Eliminating Human Error
Instead of pasting in data from your SQL results viewer, why not connect directly to your database and remove human error entirely? Excel will connect to your SQL platform, query all the data from a designated view, and insert it into a worksheet. Once you have established this connection, you will just need to refresh the data to update your spreadsheet.

Monitoring the Data Collection Process
You can use SQL views to identify null values in your database, making it easy to track and resolve missing data. You can also design your views to exclude invalid data, to ensure that only valid data is included in the data summaries. Conversely, you can create a view that includes only invalid data, creating a separate worksheet to flag and track invalid cases.

Tracking the Progress of Project Goals
For a large-scale educational survey research project, we created SQL views to monitor the progress of our survey invitations, issued incentives, and overall participation rates. This was especially useful since we were simultaneously recruiting schools to participate, sending email invitations, overseeing the survey registration and administration process, and issuing gift certificates. Automating our Excel reports made it easy to monitor these different stages of the research project, and evaluate the progress of project goals.

Visualizing the Data

When writing the code for your views, create a series of tables, each of which is a potential flat file that a researcher could easily import into a data analysis program such as SPSS or SAS. Once you create your views and you have created your worksheets containing source data, you can begin to create more accessible summary data, utilizing the Pivot Table and graphing functionalities embedded in Excel. These visuals can be used for internal monitoring or to provide regular updates summarizing progress to partner organizations. In the next posting, I will go into more detail about how to use SQL views to create an automated report that will provide you with run-time data.

- Michelle

  • Share/Bookmark

Comments (2)

On the Nature of Data, Part I

Daniel John Wilson

Daniel Wilson

Hi, I’m Daniel, and I am a data junkie.

For the last twelve years of my professional, academic, and even personal life, I have been playing with data for the purpose of analyzing it. Data, as a term, is often used ambiguously to refer to many types of information.

This is the first in a short series of postings where I will discuss some of the issues regarding using data, discussing data, and engineering data solutions, especially as it pertains to the assessment universe. Data can mean many things to many people. In order to understand this confusion, it helps me to root myself in the most basic meaning of the word. The word data translates to mean things given.

This definition of data corresponds to the the systems model (commonly referred to as DIKW for [Data-Information-Knowledge-Wisdom]) where data represents raw input without any relationships or organization.

Data-Information-Knowledge-Wisdom Model

DATA-INFORMATION-KNOWLEDGE-WISDOM Model

A data table is a set of data points and their relationships with respect to the primary key that identifies each tuple, or row, of data. But in the DIKW model, a data table like this represents information because the data is now given connections to other pieces of data to provide values for each attribute associated with each row, also called a tuple.

Ideally, a data table should contain:

1. A primary key field that uniquely identifies each tuple in each table.
2. A set of attributes and values associated with the individual tuple.
3. A set of keys from related data tables that allow for easily joining the information across tables.

Thus, a database is a collection of these kinds of data tables and their relationships.

Each data table in an assessment database will likely represent a different level of a hierarchy. In the assessment-verse, the lowest level of data tables in a database is the item response level. In this data table, each row represents a response to a single item within a single test session. Optimally, only information specifically describing a single item response (e.g., response to the item, item score (correct/incorrect), item response time) would be contained in this data table. Information that pertains generically to objects at a higher level within the item response data table should form the basis of additional tables.

Each item record should be linked to a data table that stores item specification information where each row represents an item on an exam. Information specific to the item on the exam (e.g., answer key, mappings to a content area) would then be stored here and not in each item response record. Likewise, each test result record should be linked to a data table that stored test result information where each row represents the result of a test session. This data table would hold records specific to an individual test result (e.g., final score, pass/fail, total time administration).

In our next installment of On the Nature of Data, we will dive deeper into the DIKW model and explore how data normalization is vital to the health of an assessment database. DATA FATA SECULTUS!

-dj

  • Share/Bookmark

Leave a Comment

U.S. Supreme Court Rules in Favor of Valid Tests

Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

There have not been too many court cases related to the testing industry that are commonly cited as landmark cases. It is for this reason that the recent decision by the U.S. Supreme Court in The Ricci, et. al. vs. DeStefano, et. al. case may be cited for years to come. In this case, the Supreme Court ruled in favor of some New Haven, CT firefighters who met the qualifications for promotion to Captain or Lieutenant based on test results but were denied the promotion. After analyzing the test results, the test sponsor (The city of New Haven, CT) discovered that the white candidates had performed better than the minorities and decided to use an alternative method to determine who would be awarded promotion. It was revealed during the case that the exam was constructed using the results of a job task analysis, the items were written and validated by qualified subject matter experts and the scoring model was developed in a scientifically sound manner. From the information provided, it seems as though the exam was psychometrically sound. Therefore, it is monumental that the Supreme Court supported the use of the test results despite the poor performance of minority candidates.

One thing that I find quite interesting about this case is that no one considered or conducted an empirical study to determine if the test was, in fact, biased. If you’re not an advanced test maker, then you may be saying, why would this be necessary? It’s clear that the minority candidates didn’t do as well as the white candidates, doesn’t that clearly demonstrate that the test is biased? The answer to that question is an emphatic No. I’ll illustrate why through the following example. Lets imagine a classroom teacher that separated his/her class into two groups based on her/his perception of the students’ ability. One group was comprised of the high ability students and the other of the low ability students. The teacher then administered a test to the two groups. The results came back as expected. The group that the teacher perceived to be high ability outperformed the low group. Would we conclude that the test is biased? No.

So what can we do to determine if low performance is due to a biased test?

Modern psychometric techniques allow us to use the item level test data to determine if there are any test items (questions) that are biased. These techniques, called Differential Item Functioning (DIF), allow test makers and test users to objectively identify items that are more difficult for one subpopulation than another. These analyses are more in-depth than simply comparing the percentage of majority candidates that answered a question correctly to the percentage of minority candidates that did so. Although there are several different approaches to DIF analyses, the two most common are as follows:

  • The Classical Test Theory approach. This technique separates the candidates from each subpopulation into groups based on their ability and compares the performance of the majority/minority group at each ability level. See the table below. In this situation, the percentage of majority candidates answering the question correctly was 76% compared to 70% for the minority. However, when you look at the question by ability level, it seems as if the two groups are performing pretty similarly.DIF-item1
    Contrast that to a different test item where the overall percentage of candidates answering the item correctly was again 76% for majority and 70% for the minority.  The difference between the Majority and Minority percent correct is about 5% for each of the ability groupings.  Quite obviously, there is something distinctly biased about this test item.

    DIF-Item2

  • The IRT approach. Although somewhat more difficult to understand, this approach is actually easier to calculate. To conduct a DIF analysis with IRT, simply calibrate the test twice once with each subpopulation and compare the item difficulty estimates. I recommend comparing the results by plotting the item difficulty estimates derived from one subpopulation against the other. The result is typically a fairly straight diagonal line. Biased items appear in spots different than the line.

Both of these techniques work quite well at identifying items that are biased. Once identified, test makers should review, revise and/or remove items that are showing signs of bias. If there are a large number of these items, then this a sign that the test itself is biased and should be overhauled completely.

In conclusion, if you are ever faced with a potential test bias issue, run a DIF analysis before taking your next step. In the case of the firefighters, this information may have helped the city of New Haven to defend its assessment program against test bias criticism.

B…

  • Share/Bookmark

Comments (1)

The Point-Biserial Correlation Coefficient

Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

One common metric used to assess item quality is the point biserial correlation coefficient (rpb). The “pt bis” as it is sometimes called is the correlation between an item score (1/0) and the total score on a test. Positive values are desirable and indicate that the item is good at differentiating between high ability and low ability examinees.

As straightforward as the concept may seem, there are a handful of different flavors of the point biserial. The most basic flavor relates the examinee’s item scores with their total RAW scores on the test. A slightly more robust version of this, called the corrected point biserial, calculates the relationship between the item score and the test score after removing the responses of the item in question to the total score. This flavor is very important for short tests where one item can dramatically impact the total score. Another flavor is the point-to-measure correlation which correlates the item score with the IRT ability estimate rather than the raw score. The point-to-measure correlation can come in both corrected and uncorrected forms.

Since test makers rely very heavily on the pt bis as a measure of item quality, it is important to remember that the point-biserial is greatly affected by restriction of range issues in the total test score (or measure). That is, when the total scores do not vary greatly, then the value of the point biserials is smaller than the actual relationship. This is most noticeable in difficult items where the total scores of people who answered an item correctly tend to be quite similar (in fact, they are often close to perfect scores).

Given that the size of a point biserial is greatly impacted by the variance in the scores of the examinees, it is not advisable to compare the point biserial of items across exams. A point biserial of .1 may be great for one test and really yucky for another. In addition, a point biserial of .1 for the hard items of a test may be great while it may be really yucky for the easy items of the same test.

We recommend that test makers use the point biserial as a way of identifying items that should be reviewed by subject matter experts for potential problems. We have found that items with low point biserials tend to have at least one of the following problems:

  • A debatable correct answer
  • More than one correct answer
  • No real correct answer
  • An unclear stem (The question text)

By calculating and using the point biserial, test makers can identify problematic items. From there they can collaborate with subject matter experts to improve the quality of their tests one item at a time. Look for future posts on other item quality metrics such as item fit statistics.

B…

  • Share/Bookmark

Comments (8)

What do you get from a single test item?

Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

If an examinee answers a single test item (aka question) about a specific topic correctly, what can we infer about the examinee’s knowledge of that topic?  Even with the highest quality performance assessment items, the answer is very little. That’s because there are several factors that interact when a test item is presented to an examinee such as the readability of the item, examinee motivation, or prior knowledge of the item possessed either by cheating (see Caveon for more information about cheaters) or by having a very small set of performance assessment items.

Therefore, it is useful to think about knowledge (or competency, skill, or ability) using the framework of probability. In the absence of any other information, we can infer that the probability of an examinee possessing the knowledge of a specific topic is high if the examinee answers a test item about it correctly. The corollary in corporate training would be, we can infer that the probability of an examinee possessing the skills to perform a specific task is high if the examinee performs the task correctly in an assessment situation.

Now, there are many things that can affect this probability. The fidelity of the item is certainly important. Multiple-choice questions are often bashed by test haters because their fidelity is low making it quite challenging to infer that an examinee knows something about a topic from a single response. Performance assessment items are often praised because they have a much higher level of fidelity thereby increasing the probability.

No matter what type of item is used, the probability will never reach 100%. It is this fact that makes it fundamentally necessary to approach testing from a probabilistic perspective as Item Response Theory (IRT) and Rasch Measurement Theory do. This fact also makes it imperative to collect more information or ask more test questions. Sometimes this means that test questions begin to look quite similar. I remember my 3rd grade math book where the exercises required the same computations over and over again using different numbers. Sometimes, I answered 90% of the items correctly about a topic, having missed a few items due to careless errors. My teacher was able to infer that I understood the topic in this situation. Other times, I did quite poorly, answering very few items correctly. This time the teacher easily inferred my lack of knowledge about the topic. There were some occasions where I answered a good number correctly but still missed some, answering around 75% of the items correctly. Sometimes, my teacher could inspect my responses and identify a trend, such as a specific area where I lacked the knowledge. Other times, she could not. In these situations, I typically possessed the knowledge but applied my knowledge inconsistently due to tiredness, boredom, or carelessness.

In all three of these situations, it should be clear that the information gleaned from one test item is not enough to infer mastery of the topic. This leads us to the next logical question, how many questions should a test maker ask about a specific topic? It doesn’t really matter! What does matter is that you ask enough questions about the entire content domain to make valid inferences about the examinee’s knowledge of the entire subject rather than a specific topic. If you’re really interested in an examinee’s topic knowledge, then you’ll need to build a test for every single topic, which might look a lot like the 3rd grade math test discussed earlier where the test items were very similar. Although this type of test might seem reiteratively, recursively, repetitively, redundant (uh huh!), it will not only succeed in achieving better measurement, it will help the examinees to develop procedural memory. After all, practice makes perfect.

B…

  • Share/Bookmark

Comments (2)

A Criterion-Referenced Test is NOT a Mastery Test

Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

This week’s entry is inspired by a Figure in a widely publicized book by Sharon A. Shrock and William Coscarelli called Criterion Referenced Test Development: Technical and Legal Guidelines for Corporate Training. I have not completed reading this book, but I find the authors perspective to be an interesting departure from the traditional testing paradigms (which equals 20 cents for those of you keeping score). I would like to thank them for stretching my mind and inspiring me to think differently about testing in corporate settings.

I would like to take this opportunity to continue the stretching by expanding on a topic they discuss in chapter two, which defines both criterion-referenced testing (CRT) and norm-referenced testing (NRT). A norm-referenced test (NRT) is one in which test outcomes (e.g., grades or pass/fail) are determined based on each examinee’s score relative to the other examinees. Although this practice is uncommon (and arguably unethical) it is occasionally still used today. For example, some of the state Bar exams are norm-referenced. Typically, the top X percent of examinees are awarded a passing mark, regardless of how competent or incompetent the group of test takers was that took the exam together. In other words, if a prospective lawyer was to take an exam along with the most competent group of graduates, then (s)he would have less chance achieving a passing mark than (s)he would have if (s)he took the exam alongside a group of bottom feeders. Does it surprise you that the legal profession would endorse something out-dated, scientifically unsupported and arguably unethical?

On the other hand, a criterion-referenced test (CRT) is a test composed of specific objectives, or competency statements. This type of test is common in licensure and certification. The passing rates for CRTs vary with each test cycle since examinees are evaluated based on their competency relative to a criterion-referenced passing standard (aka cutscore). There are many other attributes of these two types of tests beyond their scoring methodology, and I’ll leave it up to future posts to expand upon these.

One other type of test that Shrock and Coscarelli refer to is a mastery test, a test where most examinees answer the vast majority of the content correctly. K-12 classroom tests are commonly designed this way. The distribution of scores for a mastery test looks similar to this (Insert distribution). I think that it is important to point out that mastery tests are a form of criterion-referenced tests. In other words, Criterion-Reference Test Mastery Test. See below for a visual representation of this.

Types of Tests

So, what do we call a non-mastery CRT? To be honest, I don’t know. I have heard people refer to them as non-mastery tests or non-mastery, criterion-referenced tests.

Mastery tests are useful in the corporate training world where the content domains are small (typically measured in class hours) and the shelf-life of the training programs and tests are generally short (measured in months or years). However, they are NOT optimal for certification (corporate or non-corporate).

Mastery Test Curve

Why should a corporation build a non-mastery, criterion-referenced test? There are two primary reasons.

  1. If constructed properly, non-mastery, criterion-referenced tests provide more information than a simple pass/fail result.  Non-mastery, criterion-referenced tests are competency measurement instruments. Just as a ruler measures the length of an object, a non-mastery, criterion-reference test can measure the competency of an individual. This ruler can be used to measure the competency of individuals or the difficulty of the test questions which can provide valuable feedback to the training program or corporation.
  2. When the level of mastery changes, it is much easier to change the level of competency required to achieve mastery, than it is to write new content or a whole new exam.
  • Share/Bookmark

Comments (4)

Item Exposure Control Mechanisms

Brian D. Bontempo Ph.D - MMlog

Brian D. Bontempo, Ph.D

A few weeks ago, I had the privilege of making the annual trek to the AERA/NCME annual meeting. Typically, I’m quite cynical about this event, but this year I was pleasantly surprised. The program was well done, the presentations were pretty darn good, and the people I chatted with stimulated me. All in all, I came back pretty inspired. I’d like to give thanks to Kathy Gialluca and Daniel Bolt, the NCME program chairs and to Barbara Dodd and Chad Buckendahl who served as moderators and discussants in some of my sessions.

Today’s topic is an advanced topic and is a follow-up to the NCME session I discussed on Item Exposure Control Mechanisms in CAT (Computerized Adaptive Testing). This is a topic that is a real yawner for me because its typically business and operational motivations that dictate choice in item exposure rather than academic or psychometric rigor. Before diving in, here’s a very brief history of item exposure control mechanisms.

In the old days (circa 1980s), the CAT community referred to this as Item Selection, not Item Exposure Control. At the time, the algorithms (e.g., Kingsbury, G.G., & Zara, A.R. (1989) all contained an element of randomization. And they all worked quite well. However, there were a number of testing programs using the Two-Parameter Logistic (2PL) or Three-Parameter Logistic (3PL) models (for more information on Item Response Theory - IRT see future posts) that had ‘less than stellar’ item pools. For these programs, there were a number of items that were always selected because they were stellar performers while others were never selected because they were low quality items. Somehow, these programs got the idea that they could improve their exams by doing some complex modeling and finding a way to select these items. Thus was born the Sympson-Hetter item selection algorithm (Sympson, J. B., & Hetter, R. D. (1985). This method succeeds in establishing a maximum threshold at which an item cannot be selected again, but is extremely cumbersome to implement and sucks much of the efficiency out of a CAT. About a decade later, some Spanish researchers proposed the Progressive Method (Revuelta, J. & Ponsada, V. (1998). This method essentially selects items at random at the beginning of the test and slowly but surely turns into a maximum likelihood CAT by the end of test.

The item exposure session at NCME was an academic comparison of these using simulated data for small item pools. I had a good time reading the papers. I like being a discussant. It forces you to read papers you wouldn’t otherwise read. Moreover, it forces you to say something constructive about them (hah!). No offense to the authors. I have three points to make.

During my time as a discussant, I asked the NCME crowd what item control mechanism they were using, and I discovered that the Sympson-Hetter is slightly more popular than the randomization techniques and only a few folks are using the progressive method. I was thoroughly shocked by this. I expected most folks to be using randomization techniques. This tells me one thing…mathematically inclined Psychometricians have a habit of wasting their clients time and dollars because they enjoy mathematically modeling things that today’s computers can collect empirically just as quickly.

Point #2…Item exposure control is only necessary for programs that can’t build a good item pool. In the four papers I discussed, the authors had REALLY BAD item pools. In one, only 51 items were selected from a 176 item pool for a 40 item test. Hmmmm…why build a CAT in this situation? Two fixed forms will perform better than a CAT.

The moral of the this point is that ALL testing professionals need to be concerned with building better item pools. This fixes item exposure issues more than any control mechanism available.

Point #3…I note that One Parameter Logistic Model, or 1PL (Rasch), CATs yield flatter item exposure. That’s one more reason to implement the Rasch model for testing programs moving to CAT.

B…

  • Share/Bookmark

Comments (2)

Welcome to Mountain Measurement’s Blog

Hi!

This site is meant to be a casual, fun, and friendly braindump of the Mountain Measurement thought farm, our term for a sustainable think tank.

  • Share/Bookmark

Comments (1)

  

Bad Behavior has blocked 35 access attempts in the last 7 days.

© 2009-2010 Mountain Measurement, Inc. All Rights Reserved -- Copyright notice by Blog Copyright