Standardized Testing Accuracy and Precision
An Evaluation of NWEA's MAP Testing
( 11/07/08 )
Both state and federal (NCLB) regulations have placed great emphasis
on standardized testing as a measure of the quality of schools and teachers.
Many schools have adopted standardized testing to guide their improvement
plans. But are the tests reliable? Are the tests providing meaningful
We understand from math, science, and technology that all measures must
be checked for accuracy and precision. Both accuracy and precision must
be high for a measure to be useful. In this study, we review problems
with both the accuracy and precision of MAP testing as a guide for improving
education. We will demonstrate that MAP testing does not provide data
that is sufficiently accurate and precise to guide differentiation in
It will be left to the readers to generalize the problems reviewed in
this study to other standardized tests.
Side Notes: Supporting details
Test related research:
how to chose a test
Part 1: Introduction: Philosophical Errors Intrinsic
to MAP Testing
Any measure based on philosophical errors or scientifically inaccurate
assumptions will produce erroneous results. MAP testing has intrinsic
to its design at least two such errors.
- Linear Learning Hypothesis & the RIT Scale: The MAP test
is sold on the idea of the linear learning hypothesis. This idea suggests
that all students will learn material in the same order, and thus learning
made be defined to be levels which are equal and the same for all students
and can be measured by the RIT scale. This hypothesis is known to be
false. The assumption is so blatantly false that anyone who has worked
in education for more than a year should recognize the severity of the
error from experience alone. For example, I have had students who were
very strong in algebra who could not add fractions or perform long division
if their lives depended on those lower skills. The low level skills
had minimal bearing on their higher level successes.
- High Achievement vs. Accelerated Learning: Standardized testing
fails to distinguish between high achievement and accelerated learning.
Testing purports to determine the knowledge and skill levels of my students.
But it can not tell me whether my students can achieve at high levels.
For example, this very page was originally created for a person who
had received high scores in an educational statistics class, but still
could not evaluate the reliability of a standardized test from either
data or statistical reports. This contrasted sharply to my middle school
students who, using just basic chart and graph skills, could evaluate
the precision problems in tests. The former demonstrates accelerated
skills with low comprehension; the latter demonstrates high achievement
with low skills. MAP testing focuses on basic skills, but does not,
and can not, focus on high level achievement.
Cognition Notes: These notes will contrast the various levels
of cognition for the related concepts. These notes will support
the problem being discussed by showing that understanding test data
requires high cognitive levels, but low knowledge levels.
Knowledge Level: multiple dimensions
- 5th grade map reading
- 7th grade drawing
- 12th grade physics - vectors
- Synthesis of diverse examples of multiple dimensions
- evaluation of presented concepts drawing from experience and
Part 2: Accuracy Problems
Accuracy is about hitting the mark that you intend to hit. In testing,
that means translates to, "Did you actually measure what you intended
to measure?" If your intention is to measure success in school, then
you need to be sure you know how to test characteristics of academic or
cognitive success. Here's a listing of significant places where all standardized
testing, including MAP, fails to provide accurate information.
- Bloom's Taxonomy of cognitive levels: High level success involves
the ability to integrate diverse information, and the ability to evaluate
complex information. These two skills are nearly impossible to measure
with multiple choice tests. Standardized tests focus almost entirely
on knowledge and skills.
- Inquiry-based Learning: A real measure of academic success
is the ability to ask big questions, then seek out understanding and
knowledge that may answer those questions. This is how scientists, journalists,
and engineers work. Since standardized tests focus on low level knowledge
and skills, they never successfully test the ability to ask, and seek
answers to, big questions. Real work usually starts with the big questions
then seeks out the details. Test-based learning starts with the details
and rarely ever finds the big picture.
- Reasoning, Problem Solving and Communication: Real success
requires a person to reason through, and solve, non-routine problems,
then to communicate the validity of the reasoning. Real problems are
so complex that they may take hours to weeks to solve. This realization
is summed up in the NCTM Standards. However, standardized testing focuses
on simple concepts where each question may be answered in less than
two minutes. Real cognitive success involves creatively finding solutions
to problems that are so complex that a person may take days to solve
- Expeditionary Education: For a person to grow in skills, they
must understand themselves. For those skills to be useful, they must
integrate across a large cognitive spectrum, and create a final product
must serve a real need. This is the underlying concept of the ELS Design
Principles. Standardized tests separate skills into discreet units and
discourage educators from integrating those skills into real projects.
- Self-awareness, Marzano's Taxonomy, & Holistic Education:
Within just two years of teaching, educators will observe that most
of the barriers that impede student learning are emotions, attitudes,
or social issues. Even students who have missed many basic skills will
perform at high levels if they are given a situation that supports the
development of a good attitude, and good habits, for learning. Developing
self-awareness then becomes the key element to learning. Standardized
tests totally ignore these aspects of learning, even though these may
be the most important factors.
- Needs of High Achievers vs. Needs of Low Achievers: Most teachers
in test driven environments teach to the lowest third of their class.
Most standardized tests, including MAP, are designed around the needs
of the lowest third of the students. MAP testing has been praised for
its ability to raise test scores of the lowest performing students.
However, it is not praised for its ability to support the needs of high
performing students. This results from the achievement needs of the
high performers being structurally different than the knowledge and
skill needs of the low performers. Testing tends to discourage schools
from supporting the needs of the high achievers. Below, we will show
strong evidence that MAP testing actually discourages schools from supporting
the needs of high achievers.
In this list, we see that most measures and motivators of real academic
success are either glossed over, or totally ignored, by standardized testing.
A major cause of this problem is that real success can not be measured
by multiple choice tests. An institution that depends too strongly on
standardized tests can produce test scores that imply success. But those
scores will not measure real achievement.
- 7th grade science: accuracy vs. precision
- college: cognition and education training for educators
- Application of cognition concepts
- Syntheses and application: accuracy concepts applied to claims
- Evaluation: needs compared and contrasted to product (MAP Test)
Part 3: A review of Precision Problems
|| A: Our in-house growth
data: General Scores
We tested our students twice over the duration of half of a school
year. For each test, we created growth graphs for each grade level.
Precision problems were immediately apparent in every single test.
Many students declined more than half a year's normal growth. Many
students gained more than a year's growth. Overall, every single
test showed between 20% and 50% of the growth data falling outside
a reasonable range.
To see this problem we must ask, "Is it reasonable to believe
that students declined more than half a year, or gained more than
1 year, in just half a year?" Such change would be highly unlikely
- particularly, if changes in both directions occurred in each classroom.
This result strongly indicates that MAP testing had failed to identify
student performance precise to within an entire school year's growth.
Can data which has an entire year's worth of imprecision actually
be used to guide teachers?
||This graph show the most precise results we got from MAP in the
classroom of an expert teacher. Yet, 10 out of 38 of the growth scores
(26%) fall outside a reasonable range. Further, scores did not match
the student performance that the teacher was observing in the classroom.
What would a precise
growth measure look like?
Reading Growth Graphs:
- 6th grade: graph reading
- 7th grade science: precision
- Application: precision concepts applied to data
- Synthesis & evaluation: using awareness of imprecision to
evaluate reliability of data from presented graphs
||B: Our in-house growth data: strand
||For the tests to provide useful information we need specific details.
Teachers need to know specifically what skills should be addressed
with each student. MAP provides rough estimates of this information
with the strand data. But when we look at the strand data for a test,
fully 48% of the growth scores fall outside of a reasonable range.
The precision was too low to discern what specific support our students
graph would precise strand scores produce?
- 11th grade science: scatter plots of data
- college statistics: knowledge of r-values
- Application & Evaluation: reason about the reliability of
data reasoning from the spread of the data
||C: MAP vs. EOG
Another means of checking the reliability of a test is to compare
its results to another test. This is particularly important if the
goal is to increase the scores on the other test. Such would be
the case for high stakes testing mandated by NCLB.
So we compared the results of a MAP test to the results of an End
of Grade Test (EOG) required by our state. Again reliability was
low. MAP ranked about 5 student a year higher in performance, and
4 students a year lower in performance, (23% total) than the EOG.
An error of a full year's learning is quite significant. When guiding
instruction an error of even half a year's instruction is significant.
What good does it do a teacher to be told your student is doing
fine in chapter 8, when he still needs serious help with everything
since chapter 3?
All of our in-house data checks showed us that MAP will give misleading
results 20% to 50% of the time, making MAP unreliable to guide differentiation.
Was this our problem, or a problem intrinsic to MAP itself? Sections
D, E, and F will show that the problem is with the test, not our
school or our students.
Plots and charts:
- 8th grade: graph reading
- 11th grade: scatter plots for data
- Synthesis: apply information from related manuals
- Evaluation: reason about the spread of data
||D: Reliability: Standard Deviation
||NWEA's technical manual tells us that the best precision
for midrange scores will be about 3 points, while the precision for
extreme scores will approach 8 points. But what does this mean?
||If the standard deviation is 3 points, we expect about
68% of the data to fall within 3 points, and 95% of the data to fall
within 6 points of the score. But what precision do we need? If we
are using testing to guide instruction, ideally about 95% of scores
need to be reliable to within half a year's growth. Imprecision greater
than this does not give teachers sufficient information to guide instruction.
||So, How many points is a typical year's growth? That
we can find in NWEA's RIT Scale Norms manual. For each of the three
tests, expected growth scores are highest for third grade and drop
with each successive grade. For reading, mean growth is 14.4 points
for 3rd, but just 2.4 for 10th. For language its 9.28 points for 3rd,
but just 2.0 for 10th. For math its 15.1 points for 3rd, but just
3.8 for 10th.
||In this graph, we can see that by 4th grade all three
tests are unable to provide 95% of the data precise to within half
a year's normal learning. By 6th grade only the math test is able
to provide 95% of the data precise to within a year's normal learning.
In real terms this is like telling teachers, "The student's score
may mean the student will struggle with pre-algebra. His score may
just as well mean that he will find algebra easy." Can a teacher
really be expected to make wise decisions about instruction using
data that is imprecise to a year or more?
more detail about standard
deviation and test reliability
- 4th grade division
- 7th grade graph reading and creation
- college statistics: standard deviation
- Synthesis: applying information from manuals
- Evaluation reasoning about the implications of reliability
|| E: Reliability: r-values
||NWEA's technical manual uses r-values to estimate the
reliability of the tests. The r-values they report range from 0.76
to 0.93 with most of the values being between 0.80 and 0.89. But what
does this mean for those wishing to use the tests to guide instruction?
||One can simulate the r-values to estimate what percentage
of scores will lie within a reliable range for given r-values. An
r-value of 0.92 could easily mean that over 25% of the data lies more
than 6 points away from true. However, 6 points constitutes a year's
normal growth for over half of the tests.
||As simulated in the graph above, even with an r-value
of 0.92, 29% of the data is in error by more than a year's normal
growth. This imprecision would lead to serious errors in identifying
student needs, and then tracking the students incorrectly.
Graphs with r-values:
- 11th grade graph reading
- college statistics
- Synthesis: of divergent discussions of r-value
|| F: Negative Expected
Growth for High Achievers
|One of the most disturbing aspects of the precision
and accuracy problems of MAP testing is that NWEA's data clearly shows
that negative growth is normal for high achievers.
|In this expected growth graph, we can
see that growth can only be precisely measured for the lowest performing
students. Decline is the norm for high scoring students.
|This strongly suggests that using MAP testing actually
promotes instructional methods that do more harm than good for high
achieving students. This result should discourage the use of MAP testing
for all above average students.
of negative expected growth
Part 3: Cost Benefit Analysis
An important consideration in any program is whether the benefits justify
the costs. To evaluate this, one must be able to list both the costs and
the benefits. What did MAP testing draw from our little school's tight
- $3000 per year to test approximately 240 students.
- 6 weeks of schedule disruptions for both teachers and students interfering
with projects and planning time.
- Many hours taken away from planning and professional development time
for every teacher involved to train teachers on reading test results
and planning based on test scores.
- Hours demanded of both the network administrator and school administrators.
- Hours taken from classroom teachers to deal with parent complaints
regarding dropping scores. In almost every case, low testing precision
was the cause.
- Parent anger towards school and teachers because of stagnant or dropping
test scores. The test gives high achievers dropping scores more often
than low.. Yet the higher performing the student the more likely parents
are to be angered by dropping scores.
- Incorrectly tracking students due to significant errors caused by
low testing precision and low testing accuracy.
- Shifting emphasis of education from real accomplishments and high
cognition, to test preparation and low level skills. This may result
in lower interest and performance for all students, especially high
- Emphasis, resources, and focus drawn from average and high achieving
students, to be given to address the needs of raising test scores for
low scoring students.
- Distraction from the real needs of students including study skills,
hearing & vision problems, emotional and social problems, and other
factors that that significantly affect performance.
- Self-fulfilling low expectations for low scoring students.
The costs and risks of MAP Testing are very high, especially for the
high achieving students. In just two years of testing, most of the risks
identified here occurred within our school. Considering that the costs
and risks of testing are so high, what were the identifiable benefits?
- teachers received help determining the knowledge and skill levels
of the lowest performing students.
We were not able to identify any other clear benefits of testing. In
fact, the only success stories any other school told us about MAP testing
was that MAP helped teachers identify the academic needs of the lowest
performing students. For a school who's population is average or higher,
the costs and risks of MAP testing do not justify the benefit of the test.
- 5th grade: addition, subtraction, & division
- Synthesis & evaluation: collect and organize information,
evaluate its implications
Cognition Notes Summation
Like most real-life situations, mostly low level knowledge is required
to understand this problem. Critical thinking using low level knowledge
is needed in most situations. Yet, standardized testing reinforces
curriculum decisions that promote increasing knowledge instead of
We highly suspect that most of the problems discussed here apply to all
standardized tests. MAP was the only test that we had time and resources
to evaluate. We encourage others to perform similar evaluations on any
tests that they are using. Please let
us know of your results.