Join the conversation

...about what is working in our public schools.

Problems with the Use of Student Test Scores to Evaluate Teachers

obriena's picture

All over the country, policymakers are calling for systems that tie teacher evaluation to student performance. And from Florida to Colorado, Maryland to Louisiana, they are defining student performance as standardized test scores.

Few would argue that current teacher evaluation systems are adequate. And using standardized tests seems a cost-effective way to define performance, an important consideration in times of fiscal crisis. But are evaluation systems based on those tests valid? Can student performance on standardized assessments accurately identify effective teachers?

A new brief from the Economic Policy Institute reviews the evidence. The conclusion? Taken alone, student test scores are not a valid or reliable indicator of teacher effectiveness.

The brief discusses the lack of evidence that test-based accountability improves student learning, statistical concerns with using standardized tests to evaluate individual teachers and practical concerns with systems that do so, including the difficulty of attributing learning gains to one individual. It also raises concerns about the unintended consequences of these systems, including a narrowing of the curriculum to focus on tested content and a decreased interest in teacher collaboration in a context of individual rewards.

What I found most interesting was their review of the evidence on value-added modeling (VAM). VAM measures student growth after considering prior achievement and demographic characteristics. Some advocates claim that such models eliminate a number of concerns associated with test-based evaluation systems and could therefore serve as the basis for them.

But as this brief points out, a number of highly respected research institutions, including RAND and representatives of the National Academy of Sciences, caution against using VAM in high-stakes decision-making. One study cited found that of teachers who ranked in the bottom 20% in terms of effectiveness one year, less than a third ranked in that category the following year. And a third moved up to the top 40%. The pattern held for teachers ranking in the top 20% as well--only a third remained in that category the following year. Another third moved down to the bottom 40%. A separate study found that only about 4% to 16% of the variation in a teacher’s value-added ranking in one year can be predicted from his or her ranking the previous year.

What do these studies mean? Maybe a teacher's quality varies significantly over time. Or perhaps value-added models are measuring something other than teacher quality. A third study cited in the brief found that a student’s fifth grade teacher was a good predictor of his or her fourth grade test scores. Interesting, given that the teacher had no control over the student’s performance that year.

So what does all this mean? That we should never consider student performance in teacher evaluations? That we should resign ourselves to current systems of evaluations, which most would agree are not that great?

Absolutely not. As the authors point out, there does not have to be a dichotomy when it comes to teacher evaluations. There are a number of performance-based evaluation systems that can determine whether teachers are meeting professional standards, such as peer review models.

But this report will hopefully give some policymakers pause as they rush to develop new evaluation systems. While we can all agree on the urgency of making sure our students have good teachers, we cannot overlook glaring faults in a new system just for the sake of doing something new. We must take the time and resources necessary to develop a system that will truly ensure that all students have access to the quality teachers they deserve.


I think you skipped a step

I think you skipped a step when you went to, "There are a number of performance-based evaluation systems that can determine whether teachers are meeting professional standards, such as peer review models."

The report specifies that value-added (and mostly based on one year of data which is a bit misleading) ALONE is not dependable for high-stakes decision making. That's not what these systems you attack are constructed to do. Most of the time value-added is but one measure amongst many, frequently including peer review or other improved observations as a major data source as well.

The true dichotomy is whether student test scores will be used or not-- peer review on professional standards = not and leaves that wide fissure behind. Using value-added not as a driver, but an informant is the compromise.

This piece may not have come

This piece may not have come across as I intended. I was not intending to attack these systems. Rather, I hoped to point out evidence that suggests value-added modeling is not the panacea for improving teacher evaluation systems that some claim that it is.

You are right that the report specifies that VAM alone are not sufficient for high-stakes decision-making (as I state in the third paragraph). The four articles I linked to above--in Florida, Maryland, Colorado and Louisiana--describe models that looked to base half of a teacher's evaluation on student achievement (which I took to be standardized test scores). But that means no other element could be weighted more heavily. Does what we know about the capabilities of VAM to accurately gauge teacher impact on student progress support such systems?

I do think value-added modeling has a place. And I disagree that the true dichotomy is whether student test scores will be used or not...There is a difference between basing an entire evaluation, half of an evaluation, a quarter of an evaluation, a tenth of an evaluation, etc, on student test scores. And I believe someone looking to gain support for an evaluation system that includes VAM would pick up supporters the further he or she travels down that ladder. Certainly, a segment of the population believes the measures are so unreliable that they should not be included at all, just as another segment believes they should count for everything. But like most things, the majority of people are likely in the middle.

And by the way, you must have greater confidence than I that the tests that teacher evaluation systems would be based on actually measure what we as a society truly believe are the important outcomes for student learning. But that is a whole other argument...

I feel that the true wall is

I feel that the true wall is whether or not teachers should be held responsible for student outcomes. This is the key struggle that is being fought with unions and the key question to building better evaluation systems. Is "professional norms" enough? Is the surface appearance of a successful classroom enough? Do we know enough about what leads to directly to success to use process rather than outcomes for assessment? Is process what we believe the professional goals of teachers should be or outcomes?

Once we get past the conversation about whether or not student outcomes have a place in the evaluation story, we can then have the larger debate about how to measure these things. Value-added is clearly one important piece of the student outcomes story. It's the best information we have consistently across the most teachers. That may be a sad fact, but it's true. Recognizing that the same research that knocks the technical difficulties with value-added also shows that it's a far better metric than nearly every other proposed measure of teacher quality (what school a teacher attended, degree attainment, years of experience, most evaluation systems) in predicting student outcomes.

There is a big difference between value-added itself being 100%, 70%, 30%, or 10%. No one knows the right number because we haven't been able to pilot these systems and because most places haven't seriously thought about what to build for the rest of their evaluation. But in the end, student outcomes MUST be a part of the picture, and value-added is one of the best metrics we have for that.

An interesting rejoinder to

An interesting rejoinder to your "Is 'professional norms' enough?," would be to ask why the question comes up vis a vis teaching at all. In what other profession are "professional norms" not enough? Do we have national tests to determine the health of each of a doctor's patients (on a given Tuesday in April, no less) to determine the quality of the doctor? Are lawyers "win/loss" ratio published in the papers (or anywhere) to determine which of them are the best or worst? Are architects required to submit energy audits, user satisfaction surveys, or other conceivable measures of the quality of their construction every year to anyone?

At this point one may argue that the services of each of those professions noted above are privately procured and not "funded by my tax dollars." Which could lead into other, very interesting discussions about capitalism and the nature of professions, which I won't go into here. But I would point out that many of those doctors treat patients on Medicare and Medicaid (a.k.a. "my tax dollars"), many of those lawyers are public defenders or engage in pro bono work (ostensibly for the public good), and all those public buildings didn't get there in a haphazard way.

Especially in light of recent discussions on this site about the work of Dan Ariely and Dan Pink, we need to think about what we want to see the "profession" of teaching look like--is it about experts who are have autonomy to exercise mastery to achieve a higher purpose? Is it a routinized skill that can be taxonomically delineated and routinely performed? And which end of that spectrum leads us to better "outcomes" (however defined)?

[I think that "however defined" leads to the real rub...as a culture we don't agree about WHAT should be known, valued, appreciated and understood, let alone HOW to do those things. To presume that we can leave those undefined variables and have the rest of the equation work itself out keeps leading us back to some of the thorny conundrums we keep discussing instead. Which may sound like an argument for national standards, which it may be...but on that score, I think what we all (each) come back to is, to paraphrase George W. Bush, "If this were a dictatorship, it would be a heck of a lot easier, just so long as I'm the dictator."]

Given that VAM has high error

Given that VAM has high error rates, unstable ratings, false positives (the 5th grade teacher raising 4th grade scores), and insurmountable challenges trying to identify, let alone control for all of the factors that contribute to scores in such a small sample, I take no solace in the idea that some people would use it for smaller portions of evaluations. To acknowledge its shortcomings and then try to set an arbitrary percentage for its continued use in evaluations is illogical.

The valid use of state tests might be to look for evidence of what's happening in a school. Trends and abberrations in test scores might be an indication to look more closely at what's happening. But test scores are going to vacilate no matter what if you use enough of them over a long period of time. There's a lot of speculation involved if you read too deeply into them.

Anne makes an excellent point

Anne makes an excellent point when she notes that faith in VAM results implies faith in the validity and usefulness of the tests used. At the moment, the very tests that were criticized for being low-bar and uneven, state to state, are now being positioned as "scientifically based" measurement tools through which we can identify bad teaching. Ironic.

I don't think anybody in this national discourse believes that annual testing will--or even should--go away. The core question is around how we use the test results. Assessment should inform instruction, serving as benchmarking for student progress and a guide for what's needed next. Using assessment data to punish or shame teachers is counterproductive.

VAM--a currently flawed methodology-in-development--might become a useful tool. But that implies expertise in translating results into a public understanding of complex statistical modeling. Policymakers do not understand the limitations of VAM. Journalists do not understand the limitations. Education leaders don't understand the limitations, either. Using VAM to evaluate and improve teacher performance is like using a jackhammer when you really need a scalpel.

This is a tiresome argument,

This is a tiresome argument, and not one to help. This whole thing is based on acceptance of the teach, drill, test paradigm of the industrial model school. If we are to prepare students for this century, a change in perspective is needed. Assessment must take the form of classroom based formative student self-assessment with teacher supportive feedback on criteria and rubrics developed by students and teachers. Most people regard assessment as an entity unto itself, whereas it instead should be used as an important learning tool for and by the students. It is NOT best used to 'take the students' temperature' (or as a 'weed-out' device for teachers) but as a means for learning. Yes, it will show the students' progress, and for those who doubt the teacher's veracity, outside assessors could evaluate student work. And for those of you who think this won't work, we've had an action research group doing it since 2001 (on their own nickel - no grant!). The teachers have found higher achievement, intrinsic motivation, self-directed learning, higher order thinking, more productive habits of mind, and creativity in their students' work. Grades are determined by consensus of teacher and students, and the teachers say that the only point of contention is when students rate themselves lower on the rubrics than the teacher does. When the work is student-centered with strong student involvement, students are quite frank and honoest about their work. To 'bring this to scale' (new terminology!), a large public engagement would be required to convince politicians, the public, and especially educators that this is a necessary change in education. And the professional development necessary for educators would be huge. But it would be the way to prepare our students for the 21st century. However, it couldn't be done by the next election cycle, so politicians will dismiss it as airy visionary thinking instead of the tough stance they take in order to 'stand tall' and 'view with alarm' (or point with pride to the outliers that raise test scores ). (That makes me wonder, what would it be like if the Education Department had someone running it who actually was a teacher?)

I couldn't agree more with

I couldn't agree more with Mel. No evaluation system based on testing will work. The briefing paper also points out many negative consequences associated with using evalution systems tied to test scores, even VAM's. They include:

Discouraging teachers from wanting to work in school with the neediest students
Discouraging teachers from wanting to work with special needs students
Disincentives for collaboration
Narrowing of curriculum, group incentives tends to make narrowing even worse
Identifying wrong teachers for reward or discipline
Destroys morale
If it doesn’t accurately measure then how do you know how to improve?
Pushes out teachers
Won’t accomplish goals of identifying effective and ineffective teachers for reward and discipline.

I fear we are headed down this path with so much political momentum we can stop it. It is sad and it is a policy direction that will harm a whole generation of students.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options