## Standardized Testing Accuracy and Precision

### ( 11/07/08 )

Introduction

Both state and federal (NCLB) regulations have placed great emphasis on standardized testing as a measure of the quality of schools and teachers. Many schools have adopted standardized testing to guide their improvement plans. But are the tests reliable? Are the tests providing meaningful information?

We understand from math, science, and technology that all measures must be checked for accuracy and precision. Both accuracy and precision must be high for a measure to be useful. In this study, we review problems with both the accuracy and precision of MAP testing as a guide for improving education. We will demonstrate that MAP testing does not provide data that is sufficiently accurate and precise to guide differentiation in the classroom.

Parts:

It will be left to the readers to generalize the problems reviewed in this study to other standardized tests.

Side Notes: Supporting details

Test related research:

how to chose a test

Part 1: Introduction: Philosophical Errors Intrinsic to MAP Testing

Any measure based on philosophical errors or scientifically inaccurate assumptions will produce erroneous results. MAP testing has intrinsic to its design at least two such errors.

• Linear Learning Hypothesis & the RIT Scale: The MAP test is sold on the idea of the linear learning hypothesis. This idea suggests that all students will learn material in the same order, and thus learning made be defined to be levels which are equal and the same for all students and can be measured by the RIT scale. This hypothesis is known to be false. The assumption is so blatantly false that anyone who has worked in education for more than a year should recognize the severity of the error from experience alone. For example, I have had students who were very strong in algebra who could not add fractions or perform long division if their lives depended on those lower skills. The low level skills had minimal bearing on their higher level successes.
• High Achievement vs. Accelerated Learning: Standardized testing fails to distinguish between high achievement and accelerated learning. Testing purports to determine the knowledge and skill levels of my students. But it can not tell me whether my students can achieve at high levels. For example, this very page was originally created for a person who had received high scores in an educational statistics class, but still could not evaluate the reliability of a standardized test from either data or statistical reports. This contrasted sharply to my middle school students who, using just basic chart and graph skills, could evaluate the precision problems in tests. The former demonstrates accelerated skills with low comprehension; the latter demonstrates high achievement with low skills. MAP testing focuses on basic skills, but does not, and can not, focus on high level achievement.

Achievement contrasted to acceleration

 Cognition Notes: These notes will contrast the various levels of cognition for the related concepts. These notes will support the problem being discussed by showing that understanding test data requires high cognitive levels, but low knowledge levels. Philosophical Errors: Knowledge Level: multiple dimensions 5th grade map reading 7th grade drawing 12th grade physics - vectors Cognitive Level: Synthesis of diverse examples of multiple dimensions evaluation of presented concepts drawing from experience and training

Part 2: Accuracy Problems

Accuracy is about hitting the mark that you intend to hit. In testing, that means translates to, "Did you actually measure what you intended to measure?" If your intention is to measure success in school, then you need to be sure you know how to test characteristics of academic or cognitive success. Here's a listing of significant places where all standardized testing, including MAP, fails to provide accurate information.

• Bloom's Taxonomy of cognitive levels: High level success involves the ability to integrate diverse information, and the ability to evaluate complex information. These two skills are nearly impossible to measure with multiple choice tests. Standardized tests focus almost entirely on knowledge and skills.
• Inquiry-based Learning: A real measure of academic success is the ability to ask big questions, then seek out understanding and knowledge that may answer those questions. This is how scientists, journalists, and engineers work. Since standardized tests focus on low level knowledge and skills, they never successfully test the ability to ask, and seek answers to, big questions. Real work usually starts with the big questions then seeks out the details. Test-based learning starts with the details and rarely ever finds the big picture.
• Reasoning, Problem Solving and Communication: Real success requires a person to reason through, and solve, non-routine problems, then to communicate the validity of the reasoning. Real problems are so complex that they may take hours to weeks to solve. This realization is summed up in the NCTM Standards. However, standardized testing focuses on simple concepts where each question may be answered in less than two minutes. Real cognitive success involves creatively finding solutions to problems that are so complex that a person may take days to solve each problem.
• Expeditionary Education: For a person to grow in skills, they must understand themselves. For those skills to be useful, they must integrate across a large cognitive spectrum, and create a final product must serve a real need. This is the underlying concept of the ELS Design Principles. Standardized tests separate skills into discreet units and discourage educators from integrating those skills into real projects.
• Self-awareness, Marzano's Taxonomy, & Holistic Education: Within just two years of teaching, educators will observe that most of the barriers that impede student learning are emotions, attitudes, or social issues. Even students who have missed many basic skills will perform at high levels if they are given a situation that supports the development of a good attitude, and good habits, for learning. Developing self-awareness then becomes the key element to learning. Standardized tests totally ignore these aspects of learning, even though these may be the most important factors.
• Needs of High Achievers vs. Needs of Low Achievers: Most teachers in test driven environments teach to the lowest third of their class. Most standardized tests, including MAP, are designed around the needs of the lowest third of the students. MAP testing has been praised for its ability to raise test scores of the lowest performing students. However, it is not praised for its ability to support the needs of high performing students. This results from the achievement needs of the high performers being structurally different than the knowledge and skill needs of the low performers. Testing tends to discourage schools from supporting the needs of the high achievers. Below, we will show strong evidence that MAP testing actually discourages schools from supporting the needs of high achievers.

In this list, we see that most measures and motivators of real academic success are either glossed over, or totally ignored, by standardized testing. A major cause of this problem is that real success can not be measured by multiple choice tests. An institution that depends too strongly on standardized tests can produce test scores that imply success. But those scores will not measure real achievement.

 Cognition Notes: Accuracy: Knowledge Level: 7th grade science: accuracy vs. precision college: cognition and education training for educators Cognitive Level: Application of cognition concepts Syntheses and application: accuracy concepts applied to claims of salesman Evaluation: needs compared and contrasted to product (MAP Test)

Part 3: A review of Precision Problems

A: Our in-house growth data: General Scores
 We tested our students twice over the duration of half of a school year. For each test, we created growth graphs for each grade level. Precision problems were immediately apparent in every single test. Many students declined more than half a year's normal growth. Many students gained more than a year's growth. Overall, every single test showed between 20% and 50% of the growth data falling outside a reasonable range. To see this problem we must ask, "Is it reasonable to believe that students declined more than half a year, or gained more than 1 year, in just half a year?" Such change would be highly unlikely - particularly, if changes in both directions occurred in each classroom. This result strongly indicates that MAP testing had failed to identify student performance precise to within an entire school year's growth. Can data which has an entire year's worth of imprecision actually be used to guide teachers? This graph show the most precise results we got from MAP in the classroom of an expert teacher. Yet, 10 out of 38 of the growth scores (26%) fall outside a reasonable range. Further, scores did not match the student performance that the teacher was observing in the classroom.
 Cognition Notes: Reading Growth Graphs: Knowledge Level: 6th grade: graph reading 7th grade science: precision Cognitive Level: Application: precision concepts applied to data Synthesis & evaluation: using awareness of imprecision to evaluate reliability of data from presented graphs

B: Our in-house growth data: strand scores
 For the tests to provide useful information we need specific details. Teachers need to know specifically what skills should be addressed with each student. MAP provides rough estimates of this information with the strand data. But when we look at the strand data for a test, fully 48% of the growth scores fall outside of a reasonable range. The precision was too low to discern what specific support our students really needed.
 Cognition Notes: Scatter Plots: Knowledge Level: 11th grade science: scatter plots of data college statistics: knowledge of r-values Cognitive Level: Application & Evaluation: reason about the reliability of data reasoning from the spread of the data

C: MAP vs. EOG
 Another means of checking the reliability of a test is to compare its results to another test. This is particularly important if the goal is to increase the scores on the other test. Such would be the case for high stakes testing mandated by NCLB. So we compared the results of a MAP test to the results of an End of Grade Test (EOG) required by our state. Again reliability was low. MAP ranked about 5 student a year higher in performance, and 4 students a year lower in performance, (23% total) than the EOG. An error of a full year's learning is quite significant. When guiding instruction an error of even half a year's instruction is significant. What good does it do a teacher to be told your student is doing fine in chapter 8, when he still needs serious help with everything since chapter 3? All of our in-house data checks showed us that MAP will give misleading results 20% to 50% of the time, making MAP unreliable to guide differentiation. Was this our problem, or a problem intrinsic to MAP itself? Sections D, E, and F will show that the problem is with the test, not our school or our students.
 Cognition Notes: Plots and charts: Knowledge Level: 8th grade: graph reading 11th grade: scatter plots for data Cognitive Level: Synthesis: apply information from related manuals Evaluation: reason about the spread of data
D: Reliability: Standard Deviation
 NWEA's technical manual tells us that the best precision for midrange scores will be about 3 points, while the precision for extreme scores will approach 8 points. But what does this mean? If the standard deviation is 3 points, we expect about 68% of the data to fall within 3 points, and 95% of the data to fall within 6 points of the score. But what precision do we need? If we are using testing to guide instruction, ideally about 95% of scores need to be reliable to within half a year's growth. Imprecision greater than this does not give teachers sufficient information to guide instruction. So, How many points is a typical year's growth? That we can find in NWEA's RIT Scale Norms manual. For each of the three tests, expected growth scores are highest for third grade and drop with each successive grade. For reading, mean growth is 14.4 points for 3rd, but just 2.4 for 10th. For language its 9.28 points for 3rd, but just 2.0 for 10th. For math its 15.1 points for 3rd, but just 3.8 for 10th. In this graph, we can see that by 4th grade all three tests are unable to provide 95% of the data precise to within half a year's normal learning. By 6th grade only the math test is able to provide 95% of the data precise to within a year's normal learning. In real terms this is like telling teachers, "The student's score may mean the student will struggle with pre-algebra. His score may just as well mean that he will find algebra easy." Can a teacher really be expected to make wise decisions about instruction using data that is imprecise to a year or more?

more detail about standard deviation and test reliability

 Cognition Notes: Graphical Comparisons: Knowledge Level: 4th grade division 7th grade graph reading and creation college statistics: standard deviation Cognitive Level: Synthesis: applying information from manuals Evaluation reasoning about the implications of reliability

E: Reliability: r-values
 NWEA's technical manual uses r-values to estimate the reliability of the tests. The r-values they report range from 0.76 to 0.93 with most of the values being between 0.80 and 0.89. But what does this mean for those wishing to use the tests to guide instruction? One can simulate the r-values to estimate what percentage of scores will lie within a reliable range for given r-values. An r-value of 0.92 could easily mean that over 25% of the data lies more than 6 points away from true. However, 6 points constitutes a year's normal growth for over half of the tests. As simulated in the graph above, even with an r-value of 0.92, 29% of the data is in error by more than a year's normal growth. This imprecision would lead to serious errors in identifying student needs, and then tracking the students incorrectly.
 Cognition Notes: Graphs with r-values: Knowledge Level: 11th grade graph reading college statistics Cognitive Level: Synthesis: of divergent discussions of r-value
F: Negative Expected Growth for High Achievers
 One of the most disturbing aspects of the precision and accuracy problems of MAP testing is that NWEA's data clearly shows that negative growth is normal for high achievers. In this expected growth graph, we can see that growth can only be precisely measured for the lowest performing students. Decline is the norm for high scoring students. This strongly suggests that using MAP testing actually promotes instructional methods that do more harm than good for high achieving students. This result should discourage the use of MAP testing for all above average students.
Implications of negative expected growth

Part 3: Cost Benefit Analysis

An important consideration in any program is whether the benefits justify the costs. To evaluate this, one must be able to list both the costs and the benefits. What did MAP testing draw from our little school's tight budget?

Costs:

• \$3000 per year to test approximately 240 students.
• 6 weeks of schedule disruptions for both teachers and students interfering with projects and planning time.
• Many hours taken away from planning and professional development time for every teacher involved to train teachers on reading test results and planning based on test scores.
• Hours taken from classroom teachers to deal with parent complaints regarding dropping scores. In almost every case, low testing precision was the cause.

Risks:

• Parent anger towards school and teachers because of stagnant or dropping test scores. The test gives high achievers dropping scores more often than low.. Yet the higher performing the student the more likely parents are to be angered by dropping scores.
• Incorrectly tracking students due to significant errors caused by low testing precision and low testing accuracy.
• Shifting emphasis of education from real accomplishments and high cognition, to test preparation and low level skills. This may result in lower interest and performance for all students, especially high achievers.
• Emphasis, resources, and focus drawn from average and high achieving students, to be given to address the needs of raising test scores for low scoring students.
• Distraction from the real needs of students including study skills, hearing & vision problems, emotional and social problems, and other factors that that significantly affect performance.
• Self-fulfilling low expectations for low scoring students.

The costs and risks of MAP Testing are very high, especially for the high achieving students. In just two years of testing, most of the risks identified here occurred within our school. Considering that the costs and risks of testing are so high, what were the identifiable benefits?

Benefits:

• teachers received help determining the knowledge and skill levels of the lowest performing students.

We were not able to identify any other clear benefits of testing. In fact, the only success stories any other school told us about MAP testing was that MAP helped teachers identify the academic needs of the lowest performing students. For a school who's population is average or higher, the costs and risks of MAP testing do not justify the benefit of the test.

 Cognition Notes: Cost-benefit comparisons: Knowledge Level: 5th grade: addition, subtraction, & division Cognitive Level: Synthesis & evaluation: collect and organize information, evaluate its implications

 Cognition Notes Summation Like most real-life situations, mostly low level knowledge is required to understand this problem. Critical thinking using low level knowledge is needed in most situations. Yet, standardized testing reinforces curriculum decisions that promote increasing knowledge instead of critical thinking.

Generalizing

We highly suspect that most of the problems discussed here apply to all standardized tests. MAP was the only test that we had time and resources to evaluate. We encourage others to perform similar evaluations on any tests that they are using. Please let us know of your results.