Big Apple’s Rotten Ratings
If you’ve been following the news in education or in New York City recently, you’ve no doubt heard about the city releasing its “value-added” calculations of test data, putatively showing – with high degrees of error – the effectiveness of teachers. I’ve made it somewhat of a crusade to fight against the validity of this approach to teacher evaluation, and I won’t rehash the whole set of arguments here and now. (Though I did provide links to some of my prior posts, below). If you want to look more deeply into the current events in New York City, you can find a good list of responses provided by – who else? – Larry Ferlazzo.
So far, I think the best image from the whole fiasco comes from math teacher Gary Rubinstein, who ran the numbers himself, a bunch of different ways. The first analysis works on the premise that a teacher should not become dramatically better or worse in one year. He compared the data for 13,000 teachers over two consecutive years and found this – a virtually random distribution:
When it comes to close examination of issues involving research and statistics, another source I read frequently is The Shanker Blog, where Matthew DiCarlo consistently provides a clear analysis of data and methodologies accessible to those of us who are less versed in such matters. His deconstruction of the value-added ratings in New York City – “Reign Of Error: The Publication Of Teacher Data Reports In New York City” – was, as expected, penetrating and clear. He does not simply rail against the whole enterprise, though he does state that, overall, “a lot of these ratings present misleading information,” and, “Although many [evaluation] systems are still on the drawing board, most of the new evaluations now being rolled out make the same [value-added calculation] ‘mistakes’ as are reflected in the newspaper databases.”
However, I find myself reading DiCarlo with a rare sense of dissatisfaction. Without disagreeing with his analysis, I can’t help feeling that such a technical debate is actually legitimizing an illegitimate approach. First of all, as I’ve repeated every chance I get, the three leading professional organizations for educational research and measurement (AERA, NCME, APA) agree that you cannot draw valid inferences about teaching from a test that was designed and validated to measure learning; they are not the same thing. No one using value-added measurement EVER has an answer for that.
Then, I thought of a set of objections that had already been articulated on DiCarlo’s blog by a commenter. Harris Zwerling called for answers to the following questions if we’re to believe in value-added ratings:
1. Does the VAM used to calculate the results plausibly meet its required assumptions? Did the contractor test this? (See Harris, Sass, and Semykina, “Value-Added Models and the Measurement of Teacher Productivity” Calder Working Paper No. 54.)
2. Was the VAM properly specified? (e.g., Did the VAM control for summer learning, tutoring, test for various interactions, e.g., between class size and behavioral disabilities?)
3. What specification tests were performed? How did they affect the categorization of teachers as effective or ineffective?
4. How was missing data handled?
5. How did the contractors handle team teaching or other forms of joint teaching for the purposes of attributing the test score results?
6. Did they use appropriate statistical methods to analyze the test scores? (For example, did the VAM provider use regression techniques if the math and reading tests were not plausibly scored at an interval level?)
7. When referring back to the original tests, particularly ELA, does the range of teacher effects detected cover an educationally meaningful range of test performance?
8. To what degree would the test results differ if different outcome tests were used?
9. Did the VAM provider test for sorting bias?
I then added the following comment on the blog post, and have copied it below (with a couple of added links) to finish off my post.
Overall, I think you’ve done us a great service by digging more deeply into the labels and the meanings of different estimates and calculations. I also think everything Harris Zwerling noted above (in his “incomplete” list) presents insurmountable problems for the use of VAM to evaluate teachers given the available data.
Also, if we change some significant variables in the context and conditions surrounding the teacher’s work from year to year, then haven’t the relevant data been produced in different ways, rendering them much more difficult, if not impossible, to compare, or to blend into a single calculation? And do you know any teachers for whom conditions are held constant from one year to the next? If our goal is to compare teachers, can we assume that changes in conditions each year affect each teacher in the same way?
And finally, a stylistic request: a few times above you used the data and labels to say that “the teacher is” average, above, below, etc. I don’t think you mean to suggest that these tests actually tell us what the teacher’s overall quality or skill might be. “An average teacher” in this situation is more precisely “a teacher whose students produced average test score gains.” I think it’s important not to let our language and diction drift, lest our thoughts become similarly muddled on this topic.
Here are some other posts on the topic, by New York City educators (other than Gary Rubinstein, linked above):
Another teacher, Arthur Goldstein: No Way Out of the Evaluation Trap
Carol Corbett Burris, a Long Island principal: A simple question teachers should now ask about their profession
Here’s what I wrote about the situation when the L.A. Times made a similar mistake:
And a few other pieces that rip value-added measures used for teacher evaluation: