Thoughts on the NCLEX Invitational

Mountain Measurement’s first attempt to help nursing programs to interpret the information found in the NCLEX Program Reports was a total success. Look for the slide deck on the NCLEX Program Reports website in a few days.

We also received lots of wonderful ideas from the users about report enhancements. The following were ideas that the crowd suggested. After each one, I’ll make some comment.

1.) Add information about the failing candidates – This is a really good suggestion. I envision reporting the performance of the median failers. Although this is a really good idea, the data may render this info useless. That is, many programs have a small number of failers making it impossible for us to report stable stats. These small samples are more susceptible to extreme scores as well. Regardless, we’ll certainly look into this.

2.) Provide information about repeat test takers – This idea will have some challenges but is certainly noteworthy. The current NCLEX Program Reports only include information about first-time test takers. If we were to provide a separate section on repeat test takers, this might allow programs to understand how they could help their most challenged students to pass the exam. As with the above suggestion, the fact that there are very few repeat test takers per program may render this idea useless. Still, we won’t know this until we investigate it further.

3.) Report performance by quartile – This is another fantastic idea. There are so many ways that this could be accomplished that we’ll need to spend some time playing around with this idea before we get it right.

4.) Number of days between graduation and testing – This is another good idea although this one seems far more difficult to implement. The first hurdle will be collecting graduation date since there is no mechanism to collect data from the schools at this time. In addition, some schools are now testing before graduation while other schools have a rolling graduation.

Many thanks to the individuals that made these suggestions. If you have any ideas, please contact Mountain Measurement at your convenience.

Leave a Comment

Mountain Measurement to present at the NCLEX Invitational in Atlanta, GA

Daniel and I board the plane for Atlanta in about 6 hours. We’re pretty pumped about having the opportunity to chat with the nursing community about the NCLEX Program Reports. We’re hoping to raise awareness about the program and increase the knowledge and interpretive skills of current users. If you have an interest in nursing program evaluation or accreditation, then come join us at 3:15 in Centennial II.

We’ll also be stationed at the Mountain Measurement booth ready to register new users or take orders for the new 2010-2011 NCLEX Program Reports. There are some exciting changes on the horizon, so come see us to find out more!

Leave a Comment

ETS goes Solar

Thank you ETS for doing this.  Full article found

ACT, any chance that you’re next?

Leave a Comment

Measurement Art

The IACAT conference is being held at the Papendal in Arnhem, Netherlands. This is the place where Dutch Olympians train. On the wall, there are some amazing pieces of art that depict life size renditions of world record Olympic feats. Here are some photos of them.

Bob Beamon’s 1968 World Record Long Jump. This was almost 2 ft longer than the prior record.

Bob Beamon's Long Jump
Bob Beamon's Long Jump
Bob Beamon’s Long Jump
Bob Beamon
Bob Beamon
Bob Beamon

Javier Salabria’s World Record High Jump.  Note the stairwell railing which provides some perspective on how high this jump really was.

Javier Salabria's World Record High Jump
Javier Salabria's World Record High Jump
Javier Salabria’s World Record High Jump

Leave a Comment

IACAT Day #1

Today, I had the distinct privilege of attending and presenting at the first ever meeting of the International Association of Computerized Adaptive Testing.  Wow!  The diversity of theoreticians, practitioners, and supporters of CAT around the globe is fantastic.  We had folks from 30 different countries in attendance.  This is the logical extension of the conferences put together by GMAC and David Weiss on CAT which have occurred sporadically since the 70s.  It’s great to see a formal entity in existence to promote CAT and to facilitate the sharing of knowledge that is taking place today.

I would like to take this opportunity to comment on a few of the great presentations made today.

1.) Mark Reckase gave one of the keynote addresses.  As always Mark, gave a fantastic speech. He made one assertion that the paradigm of testing has not fully shifted from True Score Theory (aka Classical Test Theory) to Item Response Theory yet.  He claimed that IRT would not claim hegemony until we developed universal scales for some commonly used constructs like mathematical literacy.  In essence, measurement is not yet as real as meteorology because the weather nuts have universal metrics for temperature, humidity, etc… and psychometrics does not. I don’t know if I agree with his statement, but I do agree with the fact that there should be one and only one measurement scale for our basic K12 constructs.  I don’t care how you cut it, Algebra is Algebra.  There’s not a difference between Algebra knowledge in Alabama and Algebra knowledge in Maine.  Sure, Alabama and Maine may teach totally different topics from totally different perspectives, but when it comes right down to it, a student’s achievement level and academic growth in Algebra should measured universally.  In essence, an inch is an inch.  Don’t let any politician, teacher, qualitative researcher, or anti-testing hippy tell you something different.  The real issue will be finding a way to create the scales as open source scales that can be used by all testing and education professionals as opposed to owned by a single testing company.

2.) Wim van der Linden gave the other keynote address, and like Mark was also enjoyable. My main take home from Wim’s talk was the call for even more efficiency in testing than we have currently.  The first CATs cut the number of items needed by 50%. This next wave of CATs may make it possible for us to assess constructs accurately in 50% fewer questions that we use today. This is a great call to action and opens the door for many possibilities in formative assessment.

3.) Barth Riley gave a great paper presentation entitled “CAT item selection and person fit: Predictive efficiency and detection of atypical symptom profiles.” I really enjoyed this session because it exposed me to Linacre’s Bayesian Maximum Falsification (BMF) approach to item selection in CAT. He used this approach as a way to increase the validity of common person fit statistics.  Barth evaluated BMF across the whole spectrum of person fit stats including

  • infit (Wright and Stone, 1979)
  • outfit (Wright and Stone, 1979)
  • Likelihood Z (Drasgow, Levine et Williams, 1985)
  • Modified Caution Index (Harnisch & Linn, 1981)
  • H

Some of these were new to me so I appreciated the introduction. In a maximum information CAT (a traditional CAT), the probability of a correct response to an item is typically 50% which limits the utility of person fit statistics greatly. By implementing the BMF approach, the probability changes thereby increasing the viability of the fit stats.  He got good results.

4.) I’ll be posting slides from our presentation on “The theoretical issues that are important to operational adaptive testing” at some point in the near future on the MM website.  Enjoy.

I’m off to dinner.  Maybe I’ll take some photos and upload them here.

Comments (2)

Spring Conferences

OK, so it’s not April or May anymore.  Yes, it’s a little late to be providing a review of the Spring conferences put on by the International Objective Measurement Workshop and the National Council on Measurement in Education. Better late than never.

IOMW is the biennial Rasch modeling conference.  I attend a few times in the 90s but have not been back in 10 years. I stopped going because the conference contained a great deal of Rasch evangelizing and seemed to lack relevance to practitioners in the licensure and certification arena.

I am pleased to report that the 2010 IOMW was a breath of fresh air.  Here is a quick review of the sessions that caught my attention.

1.) The Rasch testlet model – The term testlet is defined very differently here than it has traditionally been used in the Computerized Adaptive Testing arena.  Here it refers to a subtest which is modeled as a separate dimension in a multidimensional framework. For more information see Wang, W., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29. I don’t know enough about this model to comment, but I will be curious to see how this differs from the MIRT (multidimensional IRT) model that Mark Reckase has been promoting for years.  The skeptic in me is still doubtful of this model.  I have not seen a multi-dimensional model replicated successfully using REAL world data.  In essence, multi-dimensional modeling is like cold fusion for measurement.

2.) A subdimension model – Steffen Brandt proposed a new model where the multi-dimensional components are modeled as subdimensions rather than as separate dimensions.  This means that the primary ability estimate is no different than it would be in a uni-dimensional framework, but the standard error of measurement (SEM) is.  Again, this is still too new for me to make really good comment, but the theoretical premise of a subdimension model jives with me.

3.) Multiple Test Form Linking Quality – When we link two test forms using common item linking, the quality of the link can very greatly.  It will depend on the number of items in common, the distribution of the difficulty of the common items and the breadth of content that those items cover. Although the quality varies, never have I seen anyone try to quantify the quality. Mary Garner gave an excellent presentation where she used paired comparison matrices to analyze the connectivity of the forms and the impact that connectivity has on parameter estimation. Mountain Measurement will certainly be investigating this further and will definitely be searching for better ways to report linking quality on our technical reports.

4.) Person Fit indices – Sébastien Béland taught me about Snijders correction to Drasgow’s likehood Z (Lz) fit stat. This stat is a good compliment to the infit and outfit stats produced by Winsteps.  Thanks to Richard Smith and his cohort of folks at the DRC, for reminding me that every fit stat has it purpose and place.  No one fit stat can tell us everything about an item, and given the relative nature of fit statistics, absolute thresholds are not a reality. Moreover, the measurement community has only just begun to use the residual matrix effectively. Look for more in the future from Mountain Measurement about this.

5.) Lastly, a big thanks to David Irribarra for his presentation on modified Wright maps. The Rasch community is notorious for data visualizations that suck. David’s work is a stark contrast to this and has some colorful, beautiful visualizations that really convey information. This was a welcome sight and compliments Mountain Measurement’s mission with the TRANSOM project quite nicely. Maybe we can find a way to work these in.

All in all, there are some impressive new folks in the Rasch community these days doing fantastic work. And the old guard including David Andrich, George Engelhard, and Mark Wilson continue to impress me. It has been 10 years since Ben Wright’s stroke and it is apparent that the next generation has stepped up to the plate. This generation presents more information and less rhetoric and doesn’t have an axe to grind. I enjoyed myself. For those Raschees that fled, I encourage you come back. This group is totally different than it was 10 years ago.

Leave a Comment

ATP Conference Preview

Brian D. Bontempo, PhD.
Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

Greetings!  It’s that wonderful time of the year, when testing professionals from around the world come together to commune, celebrate warm weather and engage in show and tell.  What am I talking about?  The annual Innovations in Testing Conference sponsored by the Association of Test Publishers.  Newbies to testing are encouraged to attend as are seasoned veterans.  I have been going to the ATP conference since it’s second year, and I find it to be the highlight of the year.

I do recommend every session in the program, even the ones on Marketing (which blew me away a few years back)!  Nonetheless, I took a few moments to take a deeper look at the conference program.  Here are some of my session recommendations.

1.) Putting the Social Back in Networking (No link available) – Tuesday Feb 9, 2010 at 3:45 PM – This should be a fantastic opportunity to play face-to-face social networking by mimicking the online tools that many of us use today.  I believe in the power of community and I would like to see more activity on the part of my fellow testing community members to tweet, facebook, and link-in.  This session will provide the inspirations for those who are nervous.

2.) The keynote address – I have read a little online about Scott Berkun and this guy has some great things to say about innovation.  Since the word innovation is in the title of the conference, I have high hopes for this session.

3.) Evaluating and Charting a Path for Innovation – This is a session in which I am presenting so consider yourself invited.  This session will provide participants with the time to think about and discuss their own innovation projects.  For anyone contemplating or undertaking enterprise level innovation, this session should provide the inspiration and space to formulate or organize your thoughts and action plan.

4.) Assessing the Hard Stuff with Innovative Items – There are a few new innovative item bank and item type providers out there right now.  This is a presentation by one of those groups, Atvantus.  Although it has been about a year since I extensively toured the item banks that are available, this group seems to be producing more advanced item types than many of the other groups.  If you’re into innovative items, this would be a good opportunity to see what they have to say.

5.) Automating the Item Review Process – The session covers entirely new ground about a topic that I am thinking more and more about each day.  The ground is natural language processing and the topic is item enemies.  Kirk’s a bright guy.  If you can keep up with him, I’m sure you’ll get a lot from this session.

I am also pretty excited by how much technological innovation is finally coming to testing.  It’s not really a surprise that it has taken over twenty years for it to get here.  Twenty years ago, the bucks were in consumer and business technology.  With the slow down in the economy, the testing industry has shined as a growing industry.  No wonder the technology providers are descending on our community.  Although I am very excited about this influx of talented people and the services they provide, I must caution all to not be over zealous about cool technology that may have little psychometric rigor with which to back it up.  A good test or item should have both face validity and psychometric soundness.  The ATP conference should provide a great forum for people interested in testing to learn more about these new technologies and the ways in which the technologies should be implemented to improve your assessment program.  Enjoy!

Leave a Comment

Why would a long test have a low reliability?

Brian D. Bontempo, PhD.
Brian D. Bontempo, PhD.

Brian D. Bontempo, PhD.

Introduction: Recently, we ran a reliability analysis for a client that is worth sharing. The certification exam had about 400 items and the reliability came in under .80 as measured by Cronbach’s Alpha (Cronbach, 1951), an undesirable if not unacceptable reliability for any test let alone a test containing 400 items. (NOTE: Mountain Measurement didn’t design or build this exam; we simply analyzed the data in an effort to begin helping the client).

Background: Reliability is a psychometric indicator of test quality. There is a direct relationship between test length (number of items) and the reliability. Many 50-60 item tests achieve a reliability that exceeds .80. Given that this exam contained hundreds of items, we set out to determine the cause of the low reliability.

Possible Cause #1- Low Item Quality: We calculated the item-test score total correlation and discovered that the vast majority of the items had positive point-biserial correlation coefficients revealing that the items were good.

Possible Cause #2- Misalignment between the difficulty of the test items (questions) and the candidates taking the exam: We calculated the distribution of test scores and in this, we found some problems. We discovered that the mean raw score was quite high (about 325) and that the standard deviation was quite small (around 15). The scores were so high that no-one scored below a 200.

When designing a test, the items should be targeted to the examinees. This maximizes the efficiency of the test. For our client, the 200 easiest items of this exam were adding absolutely nothing to the assessment. They were simply wasting the examinees’ time. Moreover, based on the standard deviation there were only about 60 items that were providing measurement information. All others were really too easy or too hard to be providing much information at all. In essence, this was a 60 item test which explains why the reliability was around .80.

Recommendation: In order to raise the reliability of this exam, we advised the client to write more items that would be similar in difficulty to the 60 items that were best targeted to the examinees based on their empirical performance.

Comments (2)

Automating Reports Using SQL and Excel

Dr. Michelle Amoruso
Dr. Michelle Amoruso

Michelle Amoruso, PhD.

If you need to provide data summaries on a weekly or daily basis, you might want to consider using Excel to query results directly from your database. This will allow you to produce and update standardized reports in seconds, with the flexibility to make rapid reports modifications (well, as fast as you can edit your SQL code). Sure, you can save your existing SQL queries and rerun them every time you need to produce a report. But why not save time and eliminate human error? Additionally, automating your reports production makes it easy to provide multiple levels of data that can be used to identify problems, track resolution, and monitor project goals.

Eliminating Human Error
Instead of pasting in data from your SQL results viewer, why not connect directly to your database and remove human error entirely? Excel will connect to your SQL platform, query all the data from a designated view, and insert it into a worksheet. Once you have established this connection, you will just need to refresh the data to update your spreadsheet.

Monitoring the Data Collection Process
You can use SQL views to identify null values in your database, making it easy to track and resolve missing data. You can also design your views to exclude invalid data, to ensure that only valid data is included in the data summaries. Conversely, you can create a view that includes only invalid data, creating a separate worksheet to flag and track invalid cases.

Tracking the Progress of Project Goals
For a large-scale educational survey research project, we created SQL views to monitor the progress of our survey invitations, issued incentives, and overall participation rates. This was especially useful since we were simultaneously recruiting schools to participate, sending email invitations, overseeing the survey registration and administration process, and issuing gift certificates. Automating our Excel reports made it easy to monitor these different stages of the research project, and evaluate the progress of project goals.

Visualizing the Data

When writing the code for your views, create a series of tables, each of which is a potential flat file that a researcher could easily import into a data analysis program such as SPSS or SAS. Once you create your views and you have created your worksheets containing source data, you can begin to create more accessible summary data, utilizing the Pivot Table and graphing functionalities embedded in Excel. These visuals can be used for internal monitoring or to provide regular updates summarizing progress to partner organizations. In the next posting, I will go into more detail about how to use SQL views to create an automated report that will provide you with run-time data.

– Michelle

Comments (2)

On the Nature of Data, Part I

Daniel John Wilson
Daniel John Wilson

Daniel Wilson

Hi, I’m Daniel, and I am a data junkie.

For the last twelve years of my professional, academic, and even personal life, I have been playing with data for the purpose of analyzing it. Data, as a term, is often used ambiguously to refer to many types of information.

This is the first in a short series of postings where I will discuss some of the issues regarding using data, discussing data, and engineering data solutions, especially as it pertains to the assessment universe. Data can mean many things to many people. In order to understand this confusion, it helps me to root myself in the most basic meaning of the word. The word data translates to mean things given.

This definition of data corresponds to the the systems model (commonly referred to as DIKW for [Data-Information-Knowledge-Wisdom]) where data represents raw input without any relationships or organization.

Data-Information-Knowledge-Wisdom Model
Data-Information-Knowledge-Wisdom Model


A data table is a set of data points and their relationships with respect to the primary key that identifies each tuple, or row, of data. But in the DIKW model, a data table like this represents information because the data is now given connections to other pieces of data to provide values for each attribute associated with each row, also called a tuple.

Ideally, a data table should contain:

1. A primary key field that uniquely identifies each tuple in each table.
2. A set of attributes and values associated with the individual tuple.
3. A set of keys from related data tables that allow for easily joining the information across tables.

Thus, a database is a collection of these kinds of data tables and their relationships.

Each data table in an assessment database will likely represent a different level of a hierarchy. In the assessment-verse, the lowest level of data tables in a database is the item response level. In this data table, each row represents a response to a single item within a single test session. Optimally, only information specifically describing a single item response (e.g., response to the item, item score (correct/incorrect), item response time) would be contained in this data table. Information that pertains generically to objects at a higher level within the item response data table should form the basis of additional tables.

Each item record should be linked to a data table that stores item specification information where each row represents an item on an exam. Information specific to the item on the exam (e.g., answer key, mappings to a content area) would then be stored here and not in each item response record. Likewise, each test result record should be linked to a data table that stored test result information where each row represents the result of a test session. This data table would hold records specific to an individual test result (e.g., final score, pass/fail, total time administration).

In our next installment of On the Nature of Data, we will dive deeper into the DIKW model and explore how data normalization is vital to the health of an assessment database. DATA FATA SECULTUS!


Leave a Comment

© 2009-2021 Mountain Measurement, Inc. All Rights Reserved -- Copyright notice by Blog Copyright