VAM Nauseum: Bleeding the Patient
In recent days, and even going back months (years?), I’ve been pestering John Fensterwald about certain details in the overall media coverage of teacher evaluation and value-added measures. Let me say up front that I know John personally from numerous encounters at various events, and I have great respect for the work he does on his blog, Thoughts on Public Education. I have faith in his objectivity and his sincere desire to understand the complete story on any issue, and even when I disagree with him, I appreciate what he brings to the discussion. That said, I have some comments below that might be seen as rather blunt, but I feel confident that they are taken as intended, as a friendly disagreement among people of mutual good will.
On June 1, John wrote “Experiments in Evaluating Teachers” to provide some interesting developments on the issue of teacher evaluations in California. This topic is of great interest to me as a co-author of the ACT report on teacher evaluation (see Publications, above), as a member of a union negotiating team working on teacher evaluation, and as a teacher and parent in California public schools. I would recommend John’s post in its entirety, but I took exception to this part in particular: “At the same time, test results are an objective measure and can be a good diagnostic tool, telling a teacher which types of students are learning the fastest, and which teachers in a school are having the most success.” I left the following comment on John’s blog post:
I don’t think we all agree the tests are objective – their biases simply originate further away from the classroom. Can you back up the claim that the tests have diagnostic value? Can you specify what you mean by “success”? Because I think what you mean is that the tests show which teacher’s students do best on the tests, and I do not accept test results as an term equivalent to “success”.
Furthermore, I have mentioned several times to you offline, and on this blog, and on my blog, that the position of APA, AERA, and NCME is that tests validated for one purpose should not be held valid for other purposes, especially where high-stakes decisions come into play. No one ever seems to have an answer for that – they just go ahead and do it. When will you, or any reporter, ask TAP, or LAUSD, or even NEA how they get around these words of caution from the leading professional bodies in educational research and measurement?
(Just to clarify, NEA as a whole hasn’t adopted that recommendation, and the language of the recommendation seems to describe tests that haven’t been developed yet – perhaps looking ahead to the supposedly improved instruments coming with the Common Core adoption).
I’m all for more substantive, ongoing rigorous teacher evaluations, and would recommend anyone interested in teacher perspectives on improved evaluations should also look at a report I helped produce:
John wrote back to me as follows:
David: Not all of the recent evaluations on value-added metrics pan it. This from an April report of the Brown Center on Education at the Brookings Institution. Susanna Loeb of Stanford was a contributor. “We have previously issued a report that describes some of the imperfections in value-added measures while documenting that: a) they provide one of the best presently available signals of the future ability of teachers to raise student test scores; b) the technical issues surrounding the use of value-added measures arise in one form or another with respect to any evaluation of complex human behavior; and c) value-added measures are on par with performance measures in other fields in terms of their predictive validity.” In other words, there’s subjectivity in every form of evaluation, including classroom and peer evaluations. So test scores, whether CSTs or locally developed assessments, should be one of several measures. My gut says counting test results as 30 percent of an evaluation, which LA is proposing and the Obama administration has pushed through Race to the Top, is too high. I look forward to the new assessments under Common Core, probing deeper levels of learning, as an improvement to the CSTs. I sense you want to do away with the CSTs altogether. I strongly disagree.
As for the using CSTs as a diagnostic tool, I refer to a presentation by LAUSD at the Ed Trust-West. The district hopes to use value-added results, now known as Academic Growth over Time to answer these questions:
*Are students in a particular region, school, or grade level growing faster than students just like them throughout the District?
*Are specific groups of students in particular schools or classrooms growing faster or slower than the district average?
*And with further observation, what instructional methods, programs and interventions are working to improve student outcomes?
*What is the distribution of effective educators? Do we have the most effective educators working in the right places to achieve our goals?
*What can we learn from places where we are achieving remarkable results?
This is not a punitive approach. The results can inform teachers and the district.
What John has offered above boils down to three ideas I’ll agree with for the sake of argument:
- Test scores generated by a teacher’s students may be reasonable predictors of future students’ test scores.
- Other evaluation methods have similar weaknesses.
- Districts are using test scores for various purposes including teacher evaluation.
But still, as much as I respect John, I think he’s spinning his wheels here. First of all, the fact that LAUSD and other districts are going ahead with these plans is not evidence that the plans are well-designed, grounded in research, or able to produce results they can use in a valid way for their purposes. The bleeding of patients used to be widely practiced “medicine” – carried out by people who all agreed it was an effective treatment, but that didn’t make bleeding an effective treatment. I reiterate, no one seems to be addressing the NCME/APA/AERA policy on test validation; tests validated for student assessment are not necessarily valid for teacher evaluation. It may be true that researchers are finding ways to predict future test scores based on test scores, but that is not the same thing as proving the teacher is responsible for the test scores. I direct skeptics to the “false positives” found in Jesse Rothstein’s 2010 study of value-added measures. It turns out the VAM data can also be used to show that fifth-grade teachers affect fourth-grade test scores. Since we know that’s impossible, the data are showing us that there are unaccounted for causes creating that effect.
In a prior blog post, I cited study after study arguing for all sorts of school variables affecting test scores. Those potential impacts on test scores can only be ignored for VAM-based teacher evaluation if you can prove that the effects are the same for all teachers at a school. Good luck with that. Those of us who work in schools can tell you how much variability there is from room to room and from year to year, despite the fact that you’re looking at whole bunch of classes labeled “Grade 5″ or “English 10.”
But there’s a deeper philosophical problem here, and it’s once again revealed in the language used in some of the research reports and political/bureaucratic documents. Look at some of the examples from above (with the caveat that I’m reacting to this information as presented and without having viewed the originals):
- The Brookings reports talks about “performance measures” – they mean test scores, and test scores alone.
- The LAUSD document asks about “growing” – they mean growing test scores, and test scores alone.
- LAUSD wants to compare “student outcomes” – they mean student test scores, and test scores alone.
- The same document refers to “effective educators” – they mean effective at raising test scores, and test scores alone.
- LAUSD refers to “remarkable results” – they mean remarkable test scores, and test scores alone.
It would be difficult to overstate the philosophical objections I have to this approach to teacher evaluation, but my philosophy is firmly grounded in teaching realities that I think are misunderstood or overlooked by the other side in this debate. As I’ve written before, those who espouse the importance of standardized test scores reveal a fundamental misperception of my job, my students, and my school.
John suggested I might “want to do away with the CSTs altogether,” but I wouldn’t go that far. I accept that a school district could use California Standards Tests and other measures in a useful way to observe significant trends in schools in the district, or among certain groups of students; that’s what the tests are designed to do, and what I assume to be their valid use per NCME/APA/AERA guidelines. However, in their zeal to improve teacher evaluation, too many districts and reformers are trying to double-down on the same limited tools they’re already over-invested in. When they do so, they risk creating an inferior evaluation system. John called attention to the portion of the Brookings report that says other types of evaluation have their flaws too. Let’s consider, for example, lesson observations as an evaluation tool. Their great advantage over test scores is that they can be tailored to examine every facet of teacher practice, they can be repeated, they can be recorded and analyzed – all of which will engage participants in meaningful examination of professional practices.
In the book Instruction That Measures Up: Successful Teaching in the Age of Accountability, W. James Popham offers up this summary based on his decades of work in the field: “Because of the inherent particularism enveloping a teacher’s endeavors, I believe the evaluation of teaching must fundamentally rest on the professional judgment of well-trained colleagues” (146). In an op-ed published last year (“Test Scores and Teacher Competency”), Popham wrote that tests for teacher evaluation are only appropriate if the teachers know what will be tested, and the tests measure what the teacher taught. “If either of these two requirements has not been satisfied, then the use of students’ test scores to evaluate teachers is unwarranted. Regrettably, at the moment, in almost all of our 50 states, neither of these requisite conditions has been satisfied” (emphasis added).
Can anyone out there rebut these experts head on, and overcome these obstacles to VAM-based teacher evaluation? So far, the answer is no. They fall back on what is statistically significant with current tests, though they can’t win the debate on what is educationally or intellectually significant about the tests. They say the tests will only be one of multiple measures, ignoring the long track record of failure when we assign any high stakes to such inadequate measures. They point to the fact that the practice is already in place, as if the bleeding of those patients should be acceptable because someone else has accepted it. Rather, I take exception to every one of those claims, and repeat, ad nauseum: you’re wrong. You’re very, very wrong. Stop the bleeding.