Interpreting Test Results

Interpreting Test Results

“Statistics is the science of producing unreliable facts from reliable figures.”—Evan Escar

Test development is a process that continues even after a test is administered. In fact, post hoc test analysis is a crucial aspect of the process. One advantage of selected-response exams is that item and test analysis data can be generated from the test results. Multiple-choice questions are particularly amenable to data analysis. These data reports provide valuable information that assists the teacher in assigning fair scores and improving individual items for future use. Test analysis has three goals: to identify whether any of the questions are flawed, to correct any errors and adjust the raw scores, and to improve the items for future use.

Qualitative and quantitative test reviews are equally important; they complement each other. Once you have the statistical data for a test, you can look at the items from a much more objective viewpoint. Inevitably, you and your colleagues who reviewed the exam before it was administered will be surprised by the results of at least 10% of the new items on a test, even though you followed test development guidelines. Sometimes, student responses to even the most expertly written questions are just unpredictable.

GET ASSIGNMENT HELP HERE

Consider the time involved in item writing and revision as an investment. Multiple-choice items can be analyzed, revised, and banked for future use. These items can be polished over time and adapted for reuse on future tests. In fact, the more you refine your items based on data, the better your tests become. Often, the qualitative student review, as discussed in Chapter 9, “Assembling, Administering, and Scoring a Test,” explains the statistical results of an item and provides suggestions for revising the item for future use. Reviewing the quantitative data not only provides an invaluable tool for making objective decisions about individual test items and overall test scores, it also guides you to use your time efficiently to improve your questions and develop a valuable testing resource: an item bank. Keep in mind that the more items you analyze, the more proficient you become at writing and identifying high-quality test items that you can bank and use repeatedly. Therefore, the time you invest is time well spent.

Before the advent of reasonably priced testing software, calculating the statistical results of an exam was a task that was impractical for a classroom teacher. Today, many colleges and universities provide machine scoring with statistical reports of tests and item analyses for multiple-choice classroom exams. The aim of this chapter is to demonstrate just how valuable these data reports are as tools for test interpretation and development; without statistical analysis, you have no assurance that your tests are functioning as you intended.

Overall Test Data Analysis

Most test development software packages provide two levels of test analysis data: the overall analysis of the test and a detailed analysis of each item as it relates to the test as a whole. While your first consideration should be to look at the overall picture, both these data sets are essential for a thorough test analysis. Examining test data is well within your purview once you understand the meaning of each of the values. Remember, you do not have to do any actual calculating. Once you use the data, you will appreciate their value to such an extent that I guarantee you will never again assign grades to a multiple-choice exam without reviewing the statistical analysis.

When a test is scored, the initial result is reported as a raw score, or the number of items that a student answered correctly on the test. Statistical analysis assists you with transforming the raw scores into test grades. Appendix B, “Basic Test Statistics,” provides an overview of the terminology of statistical analysis. Take the time to review some basic statistical references before examining the example of a statistical test report in Table 11.1.

Table 11.1 is a sample test analysis report that contains the typical data you would receive from a testing software program. In fact, this report presents more than enough data to help you make informed decisions about test results. Some programs provide even more comprehensive statistics. It is not necessary to make your review too complicated, however; this sample data report is more than sufficient for analysis of a classroom test.

Generally, item statistics for small groups of students are relatively unstable. The stability of test analysis data increases as the number of test takers approaches 100. Therefore, when you have a very small group (50 or fewer), you must consider the relative instability of the data when you interpret the analysis. In fact, test and item analysis should not be interpreted dogmatically, no matter how large the number of students. As this discussion illustrates, analysis of test data requires a variety of interpretations, both qualitative and quantitative. The size of the sample is one of the factors you must consider.

The first step in test analysis is to review the report to make sure that the data report is complete. Check the number of items and examinees, and verify their accuracy. This sample has 100 items, which means the raw score is equal to the percentage correct, and 92 examinees had their answer sheets scored. Once you verify that these figures are correct, you are ready to analyze the results of the test.

Measures of Central Tendency

Measures of central tendency provide a single value that best represents the typical score in a distribution. The mean, the median, and the mode are the three most commonly used measures of central tendency in education. While the mode, which represents the most frequently obtained score in a distribution, has limited usefulness for interpreting classroom test scores, both the mean and the median provide valuable information.

Measures of Variability

Because it is impossible to predict the range of scores for a test based on measures of central tendency alone, it is necessary for us to look further when interpreting a set of scores. In fact, two sets of scores could have the same mean and have a very different spread of scores. We need to examine measures of variability to determine how much the scores spread out from the mean or how much dispersion there is in a distribution.

Range

Standard Deviation

Reliability Coefficient

Standard Error of Measurement (SEM)