Evaluating Teachers with VAM: Variable Ambiguous Mistake

August 29, 2010

tags: epi, policy, value added measures, vam

Anyone who has read this blog before is probably aware of my position on the use value-added measurement for teacher evaluation. I have argued many times here, and in Teacher Magazine, that politicians, self-styled education reformers, and members of the general public are ill-informed if they believe that we can use state tests to determine teacher effectiveness. Accomplished California Teachers (ACT) addressed that issue in detail in our report on teacher evaluation, which also featured our recommendations for how California can improve teacher evaluations.

Imaginary VAM results for seven teachers. Image by the author.

This morning, having read Ken Bernstein’s Daily Kos post on the same topic, I have one more opportunity to address the issue, by looking at a new policy report from the Economic Policy Institute (EPI). The title is “Problems with the Use of Student Test Scores to Evaluate Teachers.” EPI convened ten experts* in the fields of teaching, learning, schools, testing, statistics, economics, and social policy, and their review of the available research yields a powerful consensus:

[T]here is broad agreement among statisticians, psychometricians, and economists that student test scores alone are not sufficiently reliable and valid indicators of teacher effectiveness to be used in high-stakes personnel decisions, even when the most sophisticated statistical applications such as value-added modeling are employed.

Of course, many of the advocates of VAM in teacher evaluations are particularly interested in firing the bad teachers. They may talk about helping identify the best teachers and helping all teachers improve, but you don’t have to be more than a casual observer of these debates to have noticed how they relish the prospect of getting tough on teachers. The authors of this report have a response to the notion that VAM will help schools clean house and produce better results:

If new laws or policies specifically require that teachers be fired if their students’ test scores do not rise by a certain amount, then more teachers might well be terminated than is now the case. But there is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones. There is also little or no evidence for the claim that teachers will be more motivated to improve student learning if teachers are evaluated or monetarily rewarded for student test score gains.

Everyone who cares about schools and students should be shouting down the proponents of VAM for teacher evaluation until they produce evidence to counter those demonstrating all of the problems with that approach. For too long, the politicians and the tough-on-teachers, tough-on-unions education reformers have been able to coast on their sound bites. They tell us we “support the status-quo;” they tell us “it’s about the students’ needs, not the adults.” They rely on the simplistically appealing but incorrect notion that, of course, student test scores measure teaching effectiveness – a notion disproven on several levels now. They tell us that schools need to run more like businesses. The EPI report authors respond:

A second reason to be wary of evaluating teachers by their students’ test scores is that so much of the promotion of such approaches is based on a faulty analogy—the notion that this is how the private sector evaluates professional employees. In truth, although payment for professional employees in the private sector is sometimes related to various aspects of their performance, the measurement of this performance almost never depends on narrow quantitative measures analogous to test scores in education.

There are many reasons that VAM fails, most of which I’ve touched upon before, and anyone wanting a more detailed overview can check Ken’s post, or read the report. I just want to call attention to one of the more interesting ones. It turns out that if you analyze the data backwards, VAM can appear to prove that next year’s teacher raised this year’s test scores. Now, of course that can’t be true. However, if VAM were valid, you would expect that it could isolate the effects of teaching that has occurred, and teaching that hasn’t yet occurred would make the data turn “fuzzy” – you wouldn’t observe any results looking at the data that way. If the data can be turned upside down and appear to show that “future effect” then that students are not randomly placed with teachers. The deck is stacked in ways that will bias the results for or against a given teacher.

I had some success with a baseball analogy last week, so I’ll try another. If you wanted to measure the effectiveness of basketball coaches, wouldn’t you need to randomly distribute the players, and account for the variable quality of the rest of the staff, and the facilities? Of course you would. So, Phil Jackson should have as good a chance of guiding the Los Angeles Clippers to a championship as he did the Los Angeles Lakers.

Now, if we were to run win-loss records through the VAM data analysis in the same backwards fashion, consider the results. Do you think next year’s lineup would appear to affect this year’s record? Of course they would, and the reason is obvious: in most cases, there will be considerable overlap. You don’t start each year with an entirely new roster. However, in schools, you do start with a new roster. The report has this to say about the phantom future effect:

Inasmuch as a student’s later fifth grade teacher cannot possibly have influenced that student’s fourth grade performance, this curious result can only mean that students are systematically grouped into fifth grade classrooms based on their fourth grade performance. For example, students who do well in fourth grade may tend to be assigned to one fifth grade teacher while those who do poorly are assigned to another. The usefulness of value-added modeling requires the assumption that teachers whose performance is being compared have classrooms with students of similar ability (or that the analyst has been able to control statistically for all the relevant characteristics of students that differ across classrooms).

On this point, it appears to me the EPI report authors could go even further in taking apart VAM. Consider the impossibility of ever identifying and controlling for “all” of the factors that could affect the data (especially dealing with samples as small as one class. Larger studies can claim to mute the effects of variables by working with large samples). Or, lower the bar from “all” to “enough” – and then tell me how you find “enough” without knowing how many there are in the first place. And in actuality, the proponents of VAM would need to account not only for the varying characteristics of students, but also the varying effects of combinations of students, the varying effects of classrooms themselves, and the varying effects of every relevant factor in the school that could affect the teacher, students, or classroom.

Good luck.

Meanwhile, more states are winning Race to the Top grants, and celebrating the opportunity to waste money on this misguided approach, as they ignore more cost-effective and proven ways to improve schools and support teachers and students. We lead the industrialized world in child poverty and poor health care, but by all means, lets pour hundreds of millions of dollars into voodoo methods to pick out the bad teachers and reward the good ones. The report authors conclude their executive summary with this sobering and entirely realistic assessment of the consequences if we continue down this path:

Adopting an invalid teacher evaluation system and tying it to rewards and sanctions is likely to lead to inaccurate personnel decisions and to demoralize teachers, causing talented teachers to avoid high-needs students and schools, or to leave the profession entirely, and discouraging potentially effective teachers from entering it. Legislatures should not mandate a test-based approach to teacher evaluation that is unproven and likely to harm not only teachers, but also the children they instruct.

*Disclosure: note that one of the EPI report authors is Linda Darling-Hammond, whose work at Stanford includes advising Accomplished California Teachers, the group responsible for this blog.

30 Comments leave one →

Peter D. Ford III permalink

August 29, 2010 7:04 pm

VAM is data; data is always, and primarily, a window through which you may seek the truth. Complaining that ‘standardized tests don’t measure learning’ does not deny the data VAM gives you. Until we change the tests, they’re all we have. While there are many other qualities that comprise an effective teacher, they are very, very difficult to quantify, if not impossible. VAM is a quantifiable measurement that if used properly can help a teacher.

The core issue is indifferent, cowardly leadership that’s failed to support teachers in their profession, allowing this disease to fester until it explodes in a bloody, infected mess. Reading the LA Times article was disheartening; the feckless attitudes of administrators who failed to even recognize quality teachers are ‘prima facie ‘ of woeful LAUSD leadership ruining teachers and children ultimately.

Reply
- David B. Cohen permalink*
  
  August 29, 2010 8:29 pm
  
  Thanks for reading and commenting, Peter. I share your opinion that ultimately, the problem is what we do with the data, or what is done to us with the data by unwise policies. However, to go deeper into your comment, I guess it would be necessary to start talking about the details of particular tests. The ones I’m most familiar with are the once-a-year reading/language arts tests taken by 9th and 10th graders, and they are incredibly mediocre exams – far from being a useful proxy for the skills I’m supposed to teach. First of all, none of my state standards say “students will be able to select correct answers to other people’s questions about random reading passages and fabricated documents.” The multiple choice questions they answer are often ambiguous and poorly worded, and they often do not require students to read the passage or understand in order to identify the correct answer. For example, if you asked me (in English), “In which section of this Polish document will you find information about visiting Warsaw?” – I only need to be able to spot the word “Warsaw.” I don’t understand a word of Polish, but if I understand the question, I can answer it. Another issue is that the tests ask a very small number of questions regarding any given skill or standard. If they only ask 8 questions about literary response, I could get lucky and have 8 that I can handle, or get unlucky and miss 3 of them. Another 8 question sample could swing the results considerably. One study found that for that reason, subtests (sections) on state standardized tests are useless for diagnostic/instructional purposes (Cizek 2007). Now just to be clear, I have no problem with assessment or high standards. I have great confidence that I’m an effective instructor, but it comes from the evidence of student growth in every standards area that I teach – not just some incredibly weak questions that do a very poor job of attempting to simulate what my students are supposed to learn in a small subset of those standards.
  
  Reply
  - Peter D. Ford III permalink
    
    September 4, 2010 3:42 pm
    
    I feel sorry for Language Arts/History folks; with mathematics it’s much easier for a standardized test to reflect student knowledge and skills. It’s not perfect, but solving a quadratic equation is solving a quadratic equation, and hasn’t changed much in several hundred years. My view on tests is colored obviously by this perspective, which again proves how complex is our craft of education.
mark permalink

August 29, 2010 9:17 pm

I still contend: the only way to effectively measure teacher quality is to thoroughly prepare administrators, instructional coaches and teacher leaders to do frequent (and deep) in-class observations. If you are going to assess someone at their job, why not walk in and watch them work? Oh yeah, that costs piles and piles of money.

By the data, perhaps my students underperform by comparison to teachers down the hall or up the freeway. Then why do I keep getting these awards and having teachers fly in from all over the region to watch me and my teaching-partner do our jobs? Teaching is (as you point out) so complex that any effort to reduce it to singular data which will assess, predict, and identify is beyond logic. There are simply too many moving parts in the system.

Reply
- Peter D. Ford III permalink
  
  September 4, 2010 3:43 pm
  
  Mark, we can go one step better: put a video camera in every room, so that every teacher can review their own performance as an athletic coach reviews game tape.
  
  Reply
teacherken permalink

August 30, 2010 6:56 pm

First, I thank David for referring to my post at Daily Kos

Second, I am neither a psychometrician nor a statistician, although I got some training in both fields as a part of a never finished doctorate in educational policy.

Like Multiple Regression, most value-added approaches attempt to control for various factors. Yet if I remember my MR, the greater the number of factors on which I regress, the larger the error factor, no? So if I am trying to control for more factors on the same amount of data, then it should make sense that I am likely to have a wider margin of error (apologies for not using technical terminology – where I am it is almost 10 PM and I was up at 5 this morning).

Of course the larger the number of cases being considered the narrower the problem.

Given all that, what I found most interesting was the study by Mathematica, which says if you attempt to use VAM to classify teachers as being away from the great middle by being superior or really towards the bottom, with 2 years of data you had a 36% rate of misclassification (hello, LA Times), with 3 years 26%, and even with 10 years 12%. Sounds to me like it ain’t all that reliable.

Add to the that the recognized instability of VAM results for individual teachers from year to year, sounds like a real reliability problem, and absent reliability, how then is one in a position to really know the inferences one draws are valid?

Gee, I wish those who bloviate about things like test scores had a real understanding of psychometrics, and that those who want to tout certain results a la the LA Times had some understanding of what the data upon which they are relying represents.

Perhaps we should value reporters by a measure with a 36% error rate in classification and see how they like it?

Reply
Mike permalink

September 15, 2010 11:26 am

David,
I always appreciate your take. and by and large I agree with most of your post. But I must admit, as economics teacher, you lost me here:

“”There is also little or no evidence for the claim that teachers will be more motivated to improve student learning if teachers are evaluated or monetarily rewarded for student test score gains.”

Now, you might merely be stating a synopsis of the “research” but I hope you don’t mean to insinuate that teachers will not respond to incentives. The tenure based system employed by most districts is fatally lacking in this one area.

Are you against incentives in principle? In my cursory glance, I do not see mention of this in your recommendations for changes to evaluation systems in CA.

MM

Reply
- David B. Cohen permalink*
  
  September 15, 2010 2:24 pm
  
  Hello Mike,
  I always appreciate being pushed to refine my thinking and writing, but this particular quotation comes from the report I’m citing. It’s not originally my statement.
  
  However, I do agree with the statement. The key detail is the distinction between learning and test scores. So teachers might very well respond to test score-based incentives, but if the focus is on raising test scores, I have misgivings about the likelihood of improvement related to desirable learning outcomes. It’s the narrowing effect that we’ve seen with NCLB, and a dose of Campbell’s Law, which as you probably know, argues that high stakes decisions based on narrow measures will inevitably corrupt the measurement tool and the value of what’s being measured.
  
  It’s also worth looking into the work of Daniel Pink, who presents a convincing case in Drive that incentives generally work better for simple tasks that we might not otherwise be motivated to do. Complex, creative tasks, and those we generally enjoy doing, tend to suffer when incentives come into play. Autonomy, mastery, and purpose will drive intrinsic motivation, which is generally more powerful and sustained than external motivation guided by rewards and consequences.
  
  Reply
  - Mike permalink
    
    September 15, 2010 8:48 pm
    
    Hi David,
    I am not familar with Campbell. However, I don’t necessarily mean to advocate the use of high stakes testing as the sole measurement tool by which we might compose an argument about teacher quality. Forgive me for speaking more generally in the context of your post. What I mean to say that the lack of performance based incentives, regardless of evaluation methodology, serves to breed a culture that I believe is detrimental to the profession. For example, in my district, princpals are unable to assign a reading to teachers on best practice without being accused of union busting. With all do respect to unions- that is a problem. I read similar accounts everyday on my associations message forum.
    
    If my school system is like most, the only way for a teacher to make additional money, is to do something other than teaching. I am intrinsically motivated to create a high quality lessons but my motivation has limits. I must consider, like many other teachers, the opportunity cost associated with staying late at work, or turning around 120 essays in 24 hours, despite what I may know is best for my students. I’d rather write some curriculum for 30 bucks an hour or coach a baseball team for 10 bucks an hour, or play in the yard with my son. Atleast I can justify these because they benefit my family.
    
    We must reward high quality teaching with money. There must be a reason for me to go visit another teacher’s classroom. There must be a reason for me to teach in a way that makes me feel uncomfortable. There must be a reason for teachers to get to work on time, attend meetings, be open to suggestions, and otherwise do their job. Many of these tasks are simple. Heck, start there.
    
    As a profession we are willing to put time and resources into evaluating and helping the least effective teachers (my district uses a Peer Assistance and Review program). But we are not willing to provide carrots for the 95% of other teachers for fear those carrots will be distributed unfairly. However, a more unfair way system is to pay the seniored teacher who continues to assign questions from a book the same or more than the young teacher who delivers dynamic differentiated instruction. Think about the culture that creates- and tell me you think it doesn’t exist- or that it is not worthy of best effort to change it. Incentives matter.
  - David B. Cohen permalink*
    
    September 15, 2010 10:28 pm
    
    Mike, thanks so much for taking the time to reply in such detail. One of the best things about blogging is having your views challenged or probed, and given that my views are somewhat limited by experience and context (like anyone else), it helps to read what you wrote. On some of the specific points, I certainly agree that an administrator should be able to assign reading without being accused of union-busting. I’ve also seen some examples of hyper-sensitivity around labor issues that are unbecoming of a professional association, which is what I think we should be. Professionals do not watch every minute on the clock, and they should expect to be engaged in ongoing development and dialogue with each other and with administrators. However, I think that a wise administrator would not “assign” reading out of the blue, but would perhaps share readings that address the school’s needs, which would be a constant topic of dialogue. I would hope that a principal either said, “I went to find resources that address our needs, and I’d like you to read this for our next meeting,” or perhaps, “Here’s an article that some teachers shared with me, and I thought it worth sharing with the whole staff.”
    
    On the larger issue of incentives, I agree with you that incentives are important, but would prioritize other incentives. Again, Daniel Pink has written what I consider the bible on motivation, Drive. Now, thinking of yourself or some other teachers, do you think you might be motivated to do your best work if you had greater autonomy to make educational decisions for your students, guided not by distant bureaucracies but rather by your own judgment, exercised within the context of a professional, collaborative community? Would your motivation increase if you were supported in efforts to become a master teacher, perhaps by designing your own professional development program to suit your needs? Would your motivation increase if you saw a greater sense of purpose in your work, perhaps by linking it more effectively with other classes, schools, and the broader community? Pink’s book argues convincingly that such priorities would yield higher quality, longer lasting results when compared with pay incentives. Now, I do agree with you that there should be some way for a highly qualified, highly effective mid-career teacher to “leap-frog” the lackadaisical veteran in pay – but I would attach that pay to differentiated roles and responsibilities that also provide greater autonomy, mastery, and purpose.
Mike permalink

September 16, 2010 8:28 pm

Thanks for the reply, David. I’ll add Pink to my reading list. I must admit that I already feel a great deal of autonomy with regard to lesson creation and decision making. However, I know others who teach the same material in my same school and feel much differently. I suppose perception is reality- which creates quite a pickle if your Pink. Any constraint could then be percieved as an attack on autonomy mastery and purpose. I tend to think those people will be miserable no matter what the circumstances. But alas- maybe we can continue this conversation after I read.

Reply
- David B. Cohen permalink*
  
  September 18, 2010 3:36 pm
  
  “quite a pickle if you’re Pink” – too funny! But regarding the more serious point beneath your wordplay, yes, perceptions are reality in these matters, to a large extent. Those of us in schools would be well-advised to start off with as much information as possible so that our perceptions are informed by many perspectives. We shouldn’t overlook the degree of autonomy that we have. Then, to the extent that some constraints are necessary and desirable, we should have opportunities to shape our own constraints. I accept the idea that I can’t just teach whatever and however I want if my class is going to fit into a cohesive program for my department and school. I’m much more likely to accept those constraints, and they’re much more likely to be followed and get results, if my autonomy is respected by allowing me a serious and substantive opportunity to influence and shape those constraints. There is also some dynamic tension among those conditions – if you are teaching with a sense of purpose, it should seem inevitable that your autonomy and mastery must serve to develop your students’ mastery, and that’s going to require some coordination with others, an acceptance of reasonable limits on your autonomy.
  
  Reply
  - Mike permalink
    
    September 19, 2010 7:27 pm
    
    I’m mmccabe over on the Washington Post (full disclosure!).
Mike permalink

October 18, 2010 5:48 pm

Pink is thought provoking so far. I’m convinced on one thing- for many the current tenure system is not based on “fairness” – a working presumption for Pink. Many are working hard… but some are not. This has a detrimental affect on those who work hardest.

Reading this: http://www.mitul.org/sites/default/files/ITUL%20BAES%20CaseStudy(15)%203-1-10_1.pdf and wondering about what you think. It’s about a school in our district that was “taken over” but essentially done so by the staff. It all seems a little too “union congratulatory” in the beginning but then when it gets to the heart of what happened its very interesting. Alot of autonomy and mastery… but I think of note- is that 1/3rd of the teachers opted out of the extra hours needed to turn around the school in question. Call me a skeptic, but I really wonder which 1/3rd opted out- and I wonder to what extent this could have happened if they do not opt out. Needless to say- it has some interestig parallels to Pink’s understanding of motivation.

In particular, I’m intrigued by the empowerment of the teachers to make decisions. And I love the part where teachers are doing walk throughs in lieu of administrators. However, I’d also like to see the pay- rather than attached to seniority- be attached to teacher effectivess- or ATLEAST in combination with teacher seniority or teacher responsibility.

Reply
- David B. Cohen permalink*
  
  October 20, 2010 2:19 pm
  
  Hi Mike – thanks for sharing the link re: BAES. I see what you mean about the tone, but it is certainly an instructive case study. On the concept of differentiating pay for teachers, I prefer a slight variation, but that still moves away from the current model. Instead of paying more effective teachers more just for continuing to be effective teachers, I’d suggest paying them more for taking on different responsibilities. And just to be clear, the measure of effective teaching would be a robust, complex evaluation. Then, if you’re among the most effective, you might be expected to mentor other teachers, participate in curricular or instructional leadership, or take on some hybrid teaching-administrative role. Or, perhaps a small bump for achieving that “highly effective” status, but slightly more taking on other responsibilities drawing on the teacher’s skills.
  
  Reply
Monica McNeil permalink

August 29, 2013 8:35 pm

I enjoyed reading your post. You touch on so many points as to why this evaluation system does not work. I taught a high percentage of functionally illiterate sixth graders last year and have a binder full of letters & artwork saying how much I helped and inspired them. I was given the “bottom of the barrel” so to speak since I was the “newbie” while the veteran teacher that taught the same subject was given the “cream of the crop”. I rose to the challenge of teaching these high risk kids & I know in my heart & soul I made a positive difference in my students lives. Unfortunately since these students came to me “low” to begin with their test scores were horrible hence making my VAM scores low. I’m now deciding whether to leave the teaching profession which I’m completely dedicated to because I was not rated as effective with this system. It is very upsetting to say the least.

Reply
- David B. Cohen permalink*
  
  August 29, 2013 11:30 pm
  
  Glad you found this post, and found it resonated with you. I’m sorry to hear you’re contemplating a premature departure from teaching over this, but sadly, I think you’d have company. This fraud of education reform is disheartening to many teachers around the country, and it seems as often as not the scores don’t matter. That is, teachers with “good” VAM scores don’t necessarily put any more stock in the approach even though they’re supposedly stronger teachers. It’s the overall atmosphere of distrust and excessive control that undermines schools and teachers. Good luck to you whatever happens.
  
  Reply

A group blog from Accomplished California Teachers: Classroom expertise for better education policy.

Evaluating Teachers with VAM: Variable Ambiguous Mistake

*Disclosure: note that one of the EPI report authors is Linda Darling-Hammond, whose work at Stanford includes advising Accomplished California Teachers, the group responsible for this blog.

Trackbacks

Leave a comment Cancel reply

Search InterACT

Comments Policy

Most recent comments on InterACT

Categories

InterACT Bloggers

Subscribe to InterACT

RSS Feeds

ACT Twitter Feed

ACT member blogs and related networks

Policy and research blogs

A group blog from Accomplished California Teachers: Classroom expertise for better education policy.

Evaluating Teachers with VAM: Variable Ambiguous Mistake

*Disclosure: note that one of the EPI report authors is Linda Darling-Hammond, whose work at Stanford includes advising Accomplished California Teachers, the group responsible for this blog.

Share this:

Related

Trackbacks

Leave a comment Cancel reply

Search InterACT

Comments Policy

Most recent comments on InterACT

Categories

InterACT Bloggers

Subscribe to InterACT

RSS Feeds

ACT Twitter Feed

ACT member blogs and related networks

Policy and research blogs