Statisticians: Value Limited for Value Added
Teacher evaluation has been a frequent topic in this space: Accomplished California Teachers (ACT) first coalesced as a teacher leadership group in large part to produce a report on evaluation that would feature teacher voice regarding current practices and promising reforms for California schools. I’ve also written frequently about an evaluation method that stands out as the worst popular idea out there – using value-added measurement (VAM) of student test scores as part of a teacher evaluation. The research evidence showing problems with VAM in teacher evaluation is solid, consistent, and comes from multiple fields and disciplines – most recently, statisticians (more on that in a moment). The evidence comes from companies, universities, and governmental studies. And the anecdotal evidence is rather damning as well: how many VAM train-wrecks do we need to see?
On the relevance of student learning to teacher evaluation, the ACT team that produced our evaluation report was influenced by the fact that many of us were National Board Certified Teachers. Our certification required evidence of student learning – after all, teaching without learning is merely a set of word or actions. Board certified or not, our team members all agreed that an effective teacher needs to be able to show student learning, as part of an analytical and reflective architecture of accomplished teaching. It doesn’t mean that student learning happens for every student on the same timeline, showing up on the same types of assessments, but effective teachers take all assessments and learning experiences into account in the constant effort to plan and improve good instruction.
Value-added measures have a certain intuitive appeal, because they claim the ability to predict the trajectory of student test scores, theoretically showing the “value” added by the teacher if the score is higher than predicted. This deceptively simple concept sounds reasonable, especially for non-teachers, and even more so for policy makers. They often seem eager to impose on teachers and administrators what is essentially one-way “accountability” for the success of schools; stagnant or declining scores bring negative consequences, so the public can be reassured that insecure school personnel will be compelled to do their jobs. Meanwhile, policy makers often ignore (because the voters and media allow them to ignore) what should be their share of accountability for the conditions of schools, and even the outside-of-school conditions that all the experts agree outweigh teacher effects on standardized test scores. Yes, you read that correctly: most of the variation in students’ test scores can be accounted for by factors outside of school – factors like family wealth, educational attainment, health care, and similar.
If you care to look at some of my prior posts on the topic of VAM in teacher evaluation, you’ll find that education researchers, economists, scientists, mathematicians, and experts in psychometrics (the measurement of knowledge) have all weighed in against the idea. Some offer stronger objections than others, but most agree that VAM is not stable or reliable enough for high-stakes usage. It has also been noted by multiple professional associations that measures validated for one purpose (measuring student knowledge) cannot be assumed valid for other purposes (measuring teacher effect). The main proponents of VAM use for high-stakes personnel decisions all seem to be economists (Hanushek, Chetty, Kane), or researchers with some vested interest in finding what they end up finding (Gates Foundation, William Sanders).
Well, the latest professional group to weigh in on the topic was the American Statistical Association. The ASA is not against the concept or use of VAM, but they do caution that VAM should only be used under a whole set of circumstances that are quite unlike the circumstances found in schools and districts using VAM. For example, VAM should be used by experts, with clear information regarding formulas and margins of error, and careful analysis of how sensitive statistical models are when the assessment changes.
Here are some choice quotes from their April 8, 2014 report:
VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.
Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
VAMs are only as good as the data fed into them. Ideally, tests should fully measure student achievement with respect to the curriculum objectives and content standards adopted by the state, in both breadth and depth. In practice, no test meets this stringent standard, and it needs to be recognized that, at best, most VAMs predict only performance on the test and not necessarily long-range learning outcomes.
[Regarding studies that have found some predictive ability in VAM scores by teachers, with “correlations generally less that 0.5]: These studies, however, have taken place in districts in which VAMs are used for low-stakes purposes. The models fit under these circumstances do not necessarily predict the relationship between VAM scores and student test score gains that would result if VAMs were implemented for high-stakes purposes such as awarding tenure, making salary decisions, or dismissing teachers.
I should also note that there’s a portion of this report I disagree with, regarding the potential use of VAM to evaluate teacher training programs:
A VAM score may provide teachers and administrators with information on their students’ performance and identify areas where improvement is needed, but it does not provide information on how to improve the teaching. The models, however, may be used to evaluate effects of policies or teacher training programs by comparing the average VAM scores of teachers from different programs. In these uses, the VAM scores partially adjust for the differing backgrounds of the students, and averaging the results over different teachers improves the stability of the estimates.
It’s unclear to me how VAM within schools or districts is recommended for observing correlation, but when extended beyond schools to involve even more complex interactions of variables known and unknown, we’re now talking about evaluating effects (causation, rather than mere correlation). While I understand the value of larger sample sizes in reaching stronger conclusions about data, I question the ability of anyone undertaking such an evaluation to control for the differences among schools. The quote above mentions only the “differing backgrounds of the students.” However, different teacher training programs develop different relationships with schools and districts. Teachers are not randomly distributed to schools or communities after their training, and the school’s and community’s effects on the teachers would seem highly relevant. There are studies that show the effects of principals on test scores, the effects teachers on teachers, effects of class period length, effects of tutoring that may or may not be available, effects of libraries that may or may not even be open, etc. My open letter to California policy makers on this topic argued, and would stand by the argument, that there are simply too many interacting variables to reach any reliable conclusions that depend on value-added measures.