Testing Precision & Accuracy Problems

supporting detail

This page is a collection of detailed explanations to support the page evaluating MAP testing and the discussion of how implementing testing lowers standards.

( 12/01/08 )


Accuracy Considerations
A guide for choosing a test
 

Purpose vs outcomes of standards

The standards movement started with good intentions. Business leaders noticed that students were coming out of high schools with a lot of knowledge but no real skills. The graduates could not take on major projects, they could not balance work between independent initiative and cooperation. College professors noticed that students were dependent learners. They failed to ask critical questions. They limited their learning to the course outline, and only learned what was required for the test. Citizens groups noticed that recent graduates could not interpret news or interpret numbers. People could not estimate numbers related to their own accounts, or understand common numerical information in the news. So various groups set out to create standards.

The standards were intended to improve real performance in real life circumstances. People needed to learn to work cooperatively, be more self-aware, integrate complex information, solve non-routine problems, and evaluate the quality of diverse information. These were the intentions of the standards movement.

However, in a short period of time the standards movement became political. Committees set up by government agencies created detailed lists of knowledge and skills that students should be expected to learn. The knowledge and skills were reduced to their elemental parts and based in tradition. State curricula were created to define when, where, and how each of those skills will be taught. Standardized tests were created to measure the collection of knowledge and skills. The Federal government passed a regulatory program (NCLB) mandating that all students must be able to pass those tests, stipulating that school funding would depend on those test results.

In the end, schools modified their strategies intending to fill students with the required skills and knowledge. Schools reduced their emphasis on critical thinking, on project based learning, on cooperative learning, on subject integration. School increased their emphasis on test skills, on basic knowledge and rote skills. Under state and federal mandates, schools all over the nation have been reducing their focus on high-level achievement, so that they could invest their energies into preparing students for high stakes tests. The standards movement, under political mandates, resulted in promoting the very problems the originators had intended to solve.

Multiple Measures

Achievement and academic potential can not be reduced to a single number and still carry much meaning. Every number we use has its own limitations. No number tells the whole story. We can review some common measures and their pitfalls:

  • Pass Rate: This is a measure of how many students are passing, or conversely not failing. Pass rate estimates only the reduction of failures. It provides no information about high achievers, or even average students. However, this is measure emphasized by school testing programs. Emphasizing this measure tends to encourage schools to invest their resources into the needs of the low achievers and structure their programs to the needs of the lowest learners at the expense of high achievers.
  • Average Score: Average score tells us a general concept about the group as a whole. But average scores can be very misleading. An increase in the average can result from an increase in a large subgroup even as other subgroups remain level or decline. Average scores can be manipulated by addressing the needs on the easiest to change subgroup, while providing very little support to the other subgroups.
  • Median Score: This measure tells us only about the middle. The low performers and high performers can experience great changes, either positive or negative, without ever changing the median score.
  • Growth Score (average or median): Although this number may provide more meaningful information than the preceding scores, it still fails to identify specific information about the needs of the low or high students, or any other subgroups.
  • Percentile Scores: To have a chart of scores instead of a single number will provide us more information. But like all the scores listed so far, it is still a one-dimensional measure. All of these measures reduce learning to a single dimension, typically using tests that focus low on Bloom's Taxonomy. Using percentile scores tends to lead to ranking students, instead of identifying specific needs of students. Ranking students frequently results in low expectations for the low scorers.
  • Individual IQ: The intelligence Quotient is an outdated attempt to measure an individual's cognitive potential. This measure offers some improvements to regular test scores in that IQ tests are designed at higher cognitive levels than most tests. But this measure still errs in that it reduces cognitive performance to a single number. Each test contains internal biases. For example, for example most IQ tests strongly emphasize reasoning about spatial patterns, but ignore the ability to write coherently, or the ability to achieve quality outcomes in team settings.

Attempts to reduce measures of cognition, learning, and education to simple numbers can prove very counterproductive. The information omitted is always greater than the information measured. The information that proves most important may never get measured, may get measured incorrectly, or may get diluted as the various measures are all averaged together.

Accuracy error discussion

 

 

 

IQ vs. Multiple Intelligences

Acceleration through low level skills vs. high achievement

.The word "level" creates much confusion in education. Level can mean grade level or knowledge level which is a reference to what knowledge and skills students should have by a given grade level. Level can mean cognitive level which is typically a reference to Bloom's Taxonomy. Or level can mean achievement level. Achievement level is similar to Bloom's Taxonomy, but focuses on real outcomes.

School administrators and teachers frequently confuse the various definitions of level. But knowledge level has nothing to do with cognitive level or achievement level. Measures of knowledge level, such as the RIT scores, focus on the two lowest cognitive levels, and ignore the highest cognitive levels. Complex thinking and achievement are totally ignored in grade level measures.

Students can work at a very high cognitive level while still demonstrating a low knowledge level. Students can demonstrate amazingly high achievement while retaining very little knowledge. But in most schools, particularly those that use standardized tests to guide curriculum, students are expected to race through knowledge while not being provided time to attain high cognition or high achievement.

This is the distinction between accelerated learning and high level learning. Accelerated learning is usually low in both cognition and achievement. Test scores may be high, while students learn information usually reserved for higher grade levels, but the students' ability to use the information wisely is rarely supported. In high (cognitive) level learning students invest great energy into building their understanding, reasoning through information, evaluating and integrating information. Students develop their ability to think, but they may fail to memorize facts needed to pass tests.

Test based programs emphasize accelerated learning over high achievement.

This leads to a problem for high achieving students placed in accelerated programs. Curriculum for core classes, such as math, is designed for the average student at that grade level. State and Federal laws require that the class be assessed using tests that are designed to identify the underachieving students. So placing high achieving students in accelerated classes results in those students being instructed according to the needs of average students, and assessed according to the needs of underachieving students. Although this tends to result in high test scores, it tends to result in superficial learning. Students get high scores with low standards, so that nobody will risk getting low scores with high standards.

 

Standards for High Achievers Vs Standards for Low Achievers

The question is, then, raised, "should standards for the high achievers be the same as for the low only faster?"

Programs for high achievers should involve deepening their thinking. The students should be given non-routine problems requiring reasoning, and demanding their own search for new knowledge to apply to the problem. They should be required to communicate what they have discovered, or created, and how they overcame challenges along the way. They should be able to discuss the similarities and differences between the problem they solved, and other real life problems. This perception of achievement is embodied in the ELS Design Principles, the NCTM Standards, the AAAS 2061 Goals, and the inquiry-based learning philosophy. There is no straight forward means to test high achievement.

Programs for low achievers usually involve very specific lists of knowledge and skills that must be learned. The information is sequenced from easy to harder. Periodic assessments are performed to ensure that the knowledge is being learned. Standardized tests are used to measure how much of the information they actually retained. This is probably not the best method for teaching low achievers, simply this method is used because it is easy to measure and reinforce.

Research has shown that most teachers teach to the needs of their low achievers. When I started at my school, we were all encouraged to use the methods for high achievers with all students. Using the expeditionary inquiry-based model, we achieved outstanding results with the top half of the class, but had doubtful results with the bottom half. In response to this challenge we hired a curriculum coordinator. Within two years time, we switched to using the methods designed for low achievers with all students. This improved test scores, but eliminated the phenomenal successes for the high achievers.

 

Test Results: Information vs. Measure

We test because we need information. School administrators need to know what specific goals to focus on. Teachers need to know specifically what skills to remediate, and how fast they can move through material. Students need be told what specific concepts to review, and how to improve. The goal of testing is to provide this information.

However, the methods that we use to grade tests undermines this very goal. We grade tests by scoring them, by returning a number. What does that number tell us? Using a single number to score tests washes out the very information that we need. The state may tell us that a student scored a 612 on the EOG, or NWEA tells us that the student got a RIT score of 231. But what exactly does this mean?

The number does not tell me what specific skills the student did not understand. The number does not tell us what caused the student's difficulty in understanding. The number does not tell us what should be changed. These test results provide almost no useful information that can guide our decisions for future instruction.

This is the problem with using test scores to drive school improvement. The scores do not provide the information that educators need to improve student knowledge, comprehension, and achievement. Test scores can even discourage educators from distinguishing between knowledge, comprehension, and achievement.

To guide improvement educators need information lists. The information gathering needs to ask, "What appears to be the greatest problem?" and "What appears to be the cause of the problem?" These questions need to be asked for both the individual learners, as well as the group. Current testing methods do not provide this type of information.

 
 

Analogies to test accuracy

We may demonstrate the problem of test scores using analogies. If you are not a teacher, imagine that your company wanted a quantified measure of your work. Do you think they could create an assessment that successfully ranked each worker based on a single number measure? Here are some analogies

  • Hospitals and Nursing: Under NCLB schools are required to count how many students don't fail and the schools are judged accordingly. By analogy we might judge hospitals by counting the number of patients that don't die. Will this provide valid meaningful results? A maternity hospital would have a much higher success rate than a dialysis hospital, or an elder care facility. But should we conclude that the maternity hospital is actually better, and more worthy of funding? Hospitals in rich neighborhoods where the emergency room specializes in problems such as tennis elbow would get higher scores than a hospital in an inner-city neighborhood where the emergency ward addresses needs such as stab wounds, drug overdoses, and AIDS. Can the scores really provide a valid comparison?
  • Political Districts: Many studies over many decades have shown that test scores correlate strongly to the Socio-Economic-Status (SES), or real estate values, of the neighborhood. So it makes as much sense to grade politicians by how many citizens are in poverty as it does to grade schools by how many students fail. But is this measure accurate? Is a politician in a wealthy neighborhood always better than one in a poor neighborhood? How much control does the local politician have over the poverty rate in his neighborhood? How much control do schools have over the poverty or family issues of their students?
  • Your Job: What single simple number is the easiest measure of passing work at your job? Would measuring that single number constitute a valid assessment of your work? Would measuring changes in that number constitute a valid measure of your growth, or might it measure factors outside of your control? Would it be just to base your funding on one single number?
 
 
Precision Considerations
 

Standard Deviation, r-value, and differentiating instruction

Standard Deviation (sigma) assesses how much measures of data tend to vary from their true values. To use any measure to reliably make decisions about individuals the standard deviation must be low compared to the distinctions you wish to make. How low standard deviation must be depends on the quality of the measure that you need.

As a rule of thumb, you will want twice the standard deviation (2 sigma) to be smaller than the distinctions that you need to use. For example supposing you are creating three math groups: above grade level, grade level, and below grade level, and you want to use test scores to place your students into groups. If your grade level range 20 points, and your precision estimate (1 sigma) is 3, then you still have a high risk placing any student who is within 6 points of your dividing lines into the wrong group.

In this demo, 5 students who should be in the low group are at risk of being placed in the middle group, 10 from the middle are at risk of being placed in either the low or high, and 5 from the high are at risk of being place in the middle.

In the above high precision graph, 20 out of 40 students are at risk of being placed in the wrong skill level group.

Below we demonstrate a graph of a low precision test. Each grade level spans 20 points, but the standard deviation of the test is 8 (2 sigma is 16.) Every single student is at risk of being tracked into the incorrect group for his skills. Students in the middle have fair odds of being placed in either the slow group, or the accelerated group.

Although this demo is labeled"Low Precision" it actually demonstrates higher precision than MAP testing does for middle school and high school.
In this example, it is more likely than not, that at least one middle student will be tracked low, at least one middle student will be tracked high, at least one high student will be tracked middle, and at least one low student will be tracked middle. Do you really want your child's education to be tracked based on imprecise test scores?
Standard Deviation: overview
6 sigma philosophies  
 
MAP standard deviation  
   
   
   
   
   
   
   
MAP Precision  
Test Error Decision Trees  
   

Expected Growth

  When reading a graph a person should have an idea what values make sense, and what values are too extreme to make sense. When evaluating student growth over half a year, you would expect to see a graph something like this: very few scores declining, most scores within the range of a normal half year's growth.
 
  But when our school used MAP, all of the graphs were similar to the one below. Many scores were above a year's growth, and many scores showed declines of more than half a year's growth. These extremes indicated precision problems intrinsic to the test. Too many scores outside the reasonable range should have resulted in reliability discussions, and advising teachers not to rely on the data.
 
  For a validation check we can simulate the precision that we are offered. For some of our tests, the expected growth was about 3 points, but the standard deviation was also 3 points. In the graph below we see what tested growth we would expect from a grade of 42 students where every single student's "real" improvement was 3 points. Even though each student grew by three points, 10 scores decline, 9 are over twice as large as real. Only 9 scores would likely come close to to the correct growth of 3 (between 2 and 4).
 
 
Similarly, perfectly precise strand data would have looked something like this. Growth being confined to within a reasonable range, and r-values in the high 0.90s.
 
  Instead, with MAP, strand scores looked like this - a scatter plot in the worst sense of the term. All of the correlation coefficients were lower than 0.35. No decisions about individuals should ever be made based on test data when the data scatter looks this bad.
 
  In our school, all of the middle school math and science teachers could tell the data was unreliable from the graphs alone. Some others could not see the precision problem.
MAP growth precision

Random Chance Factors in Bubble Tests

Suppose a student is taking a multiple choice test and he guesses on 10 questions that he does not understand. His test score will precisely reflect his true knowledge only if every single guess was wrong. There is only a 11% chance that all ten guesses will be wrong. But there is a 12% chance of 4 or more guesses being right. If the student guessed on 20 questions, it is more likely that he guessed 8 or more correctly, than that he guessed none correctly. The more students guess the lower the precision will be, the more the scores will vary from true, the less reliable the scores will be.
On a multiple choice test (5 choices each question), if a person guesses 10 times, they will most likely guess correctly 2 or 3 times. But they still have good odds of guessing correctly 5 times. If they guess 20 times, they will most likely guess correctly 3 to 5 times, but they still have good odds of guessing correctly 8 times. As guessing increases, the reliability of their test score decreases.
But this considers only random chance. Chance is not the only factor that lowers precision. Both cognitive science and common experience tell us that all people have bad days when our thinking and memory are not sharp. Bad days will lead to scores being lower than they should be. On the other side, some students are good guessers. Good guessing will lead to both higher than realistic scores, and to lower precision. Human factors in testing combined with random chance factors lead to lower testing precision, in many cases to precision too low to provide useful information.
MAP is a self-adjusting test. The more questions a student gets right the harder the questions he will be given. But the harder the questions he gets the higher his chances are of just guessing. The more he guesses the lower the precision of the test will be. The self-adjusting model for testing guarantees low precision.
 
 

Testing as a Statistical Process

Testing, at its best, is a statistical process used to inform a decision. Well known to this process is the interpretation of errors, commonly known as Type I errors (false positive) and Type II errors (false negatives).

If we are using a multiple choice test with 5 answers for each question, the lowest odds of false positives are quite easy to determine. The student has 1 chance in 5 of guessing correctly on one question, but only 1 chance in 125 for guessing three questions correctly. Thus to limit false positives at least three questions are needed for each assessed skill. Of course, good guessers can easily improve their odds of false positives for three questions to 1 chance in 64 or better.
Estimating the rate of false negatives is not quite as easy. There is always a risk that a student who is strong at the skill being tested will pick a wrong answer for some reason: misreading the question, silly arithmetic errors, incorrectly marking the answer, etc. The odds of false negatives are lower than the odds of false positives, but still a very real part of testing.
The MAP test technical manual shows a theoretical curve relating the odds of a student guessing an answer correctly to the student's skill level. The curve has one very obvious error. Identifying that error will help us proceed to notice other more subtle errors.
The curve shows the odds of a student with no knowledge guessing correctly to be 0%. As discussed above, from elementary statistics the real odds for random guessing are 20% (1 chance in 5.) If the interpretation of a test were to be actually based on the theoretical model in the graph, it would give misleadingly high results for all students with skills below that level representing a 20% chance.

But there are more subtle problems with the theoretical model in that graph. Good guessers improve their odds on tests by eliminating answers that are most likely wrong, even without knowing from the skill being tested which answer is correct. For them the graph comparing the odds of a correct answer vs. their skills will be somewhat stepwise, with their score always tending to be higher than the theoretical model.

But the greatest problem comes from realizing that people frequently misunderstand information. When a person learns just enough new information to misunderstand, they are more likely to choose a wrong answer than if they just guessed. For example when a student first learns the arithmetic of integers they frequently interchange the properties for addition, subtraction, and multiplication. Because of this confusion they will choose wrong answers more often than if they were just guessing. Thus, a person with initiatory knowledge will tend to score lower than a person, who having no knowledge at all, just guesses!

A well designed test will report when such patterns exist. Although I have read many examples of misunderstanding leading to lower scores than random chance, I have never yet heard of a test report that actually identifies such patterns.

Statistical & Learning Errors

High Achievers vs. Outliers

Faith in testing frequently results from the innumeracy of those promoting testing. Promoters of testing typically don't understand the numerical concepts of testing.

In response to testing reliability reports I made to our committee, our curriculum coordinator e-mailed me the following response: "Karl, I haven’t digested this fully, but reading the highlighting I do recall from my psychometrics text that all standardized tests correlate to negative growth for high achievers, simply because they are the out-liers and tend to regress toward the mean." The various errors in this statement are worthy of serious discussion. This statement demonstrates serious innumeracy, a habit of learning low on Bloom's Taxonomy (memorizing without understanding), and extremely low expectations for high achievers.

The first problem is the failure to distinguish between a valid out-lier (extreme score) and a large testing precision error (doubtful score.) For a real outlier (e.g.: high achievers) repeated testing should result in approximately the same high score. High achievers are not statistical accidents. Their scores have no natural reason to regress to the mean.

Any student can have a day when their test score is an out-lier for their own personal skill level, regardless of their skill level. Their own personal outlier could be either low or high. In subsequent tests, their scores will tend to regress towards their own personal mean, not the mean of the group.

The statement from the e-mail above demonstrates a failure to distinguish between high achiever and test error. It demonstrates a failure to distinguish from expectations for individuals (regressing towards one's own personal mean) and expectations for the group (each individual tending to retain his relative rank within the group.)

In fact, high achievers should not regress towards the group mean, they should move progressively farther from the group mean. The very abilities, attitudes, and effort that made them rise above the rest of the group should continue to define their performance, creating constantly growing difference between their performance and that of the group. Like one driver going 60 mph while the other drivers go 40 mph, the driver going 60 mph will not fall back into the group (regress to the mean), instead he will continue to get even farther ahead of them. Expecting high achievers to regress to the mean demonstrates both a failure to understand "regression to the mean" and a failure to understand high achievement.

The claim that high achievers will "regress to the mean" demonstrates extremely low expectations for high achievers. It suggests that one believes that the high achievers have no natural abilities, personal attitudes, effort, or background knowledge that leads them to succeed. This attitude sees high achievers as flukes, mere testing accidents, who should be expected to fall back into the norm. There is no natural reason, intrinsic to high achievers, that should cause their performance to fall. Such declines must result from either the test or the instructional methods.

So then, what is implied that MAP, and possibly other standardized tests, correlate to negative growth for high achievers? It could imply a few things. It might suggest that the makers of the test did not know how to measure high achievement. The high end of the test is full of errors and is thus unreliable. But this would tend to cause scores to fluctuate wildly, not to simply first rise and then fall.

The negative growth implies that schools that depend on standardized testing to guide curriculum actually make choices that hurt high achievers. Testing actually leads to instructional and curriculum choices that are bad for high achievers. Some aspect of testing, and test-based instruction, actually undermines the success of high achievers. This is not surprising since most testing is directed at the needs of the low achievers. It is therefore advisable to reject any decisions based on standardized tests when scores are in the negative growth range.

 
Summation:

In these discussions we have noticed that standardized testing has serious accuracy problems, and serious precision problem. We have noticed that standardized testing tends to lead schools to lower their standards, and that testing may actually lower the achievement of the top students. These problems are observed by many teachers, but are never acknowledged by politicians who pass laws requiring testing, or school administrators who implement testing in schools.

 
 
 
 

Footnote: Myers-Briggs Personality Types and Learning

In the Myers-Briggs profile, those who tend to be most proficient at math are introverted, intuitive, thinkers (INTP and INTJ personalities). They readily understand big picture concepts, intuit strategies to solve complex problems, create new approaches, and reflect on their own learning. They frequently demonstrate rapid grasp of the general concepts, even while making unusual errors with the details. They learn new ideas in seemingly random order as the relationships between the ideas drives their thinking.

Those who manage schools and curriculum design programs tend to have sensing-judging (SJ) personalities. They make lists of everything that must be learned. They make calendars scheduling when each detail must be learned. They prescribe the official order for learning, typically starting with small easy details building into more complex skills, and ending with application of those skills. While laboring to ensure that all the details are laid out in the right order, they frequently fail to support the very intent for the learning.

The SJs create curriculum designs that do not support the instinctive intuition of the INT personalities. These curriculum designs do not promote the learning and working styles that will be used in engineering and science - careers that INTs tend to gravitate towards.

The NCTM Standards, the Expeditionary Design Principles, and inquiry-based learning model were all designed to acknowledge the learning styles of intuitive introverts, as well as the goal of teaching how to do real work with the skills that one is learning. In contrast, most states have set as their high school math standards a list of skills that will not support the vast majority of students in achieving real outcomes as adults.

Ref: Please Understand Me

 

 

Return to

 
 
Hosting by WebRing.