Friday, January 25, 2013

Gates MET Study Produces $45 Million Worth of Questionable Methodology and Conclusions

Testing expert Skip Kifer shot me a note recently, after I had posted Walt Gardner's doubts about the quality of some educational research, which I had titled BS in Science.
If you want to see a real mess, go to the reports of MET, the Gates stuff on teacher effectiveness. The good lord only knows what they did statistically and why they did it. They have all of the fancy Greek symbols and statistical blather. The paper I perused fit your critique well;  why should I believe what they conclude if they cannot tell me what exactly they have done and for what reasons. Of course all of the reporting of the study has concluded that Gates has solved forever (meaning one years worth) all questions associated with teacher effectiveness.
I finally got around to taking a peek. Being a "qualitative guy," I'll defer to Skip and the "quants" on the particulars, but here's some of what I found.

This from The Great Lakes Center for Education Research and Practice:

Gates Foundation report on teacher evaluations seriously flawed, 
leading economist finds 

'Measures of Effective Teaching' report is based on 
flawed research and predetermined conclusions, review shows
A report on teacher evaluations recently released by the Bill and Melinda Gates Foundation has been refuted by one of the nation's leading economists, who found the widely published report to be seriously flawed.

The Gates Foundation last month released the first report of its "Measures of Effective Teaching" (MET) project, which aims to develop a reliable method for evaluating teachers. The report was thoroughly reviewed for the Think Twice think tank review project by University of California at Berkeley economist Jesse Rothstein, former chief economist at the U.S. Department of Labor.

Rothstein, who is also former senior economist for the Council of Economic Advisers, found the Gates Foundation's MET report to be based on flawed research and predetermined conclusions.
The review was produced by the National Education Policy Center (NEPC), housed at the University of Colorado at Boulder School of Education, with funding from the Great Lakes Center for Education Research and Practice.

Rothstein's analysis found the MET report draws conclusions that are not supported by its own facts, with some data in the report pointing "in the opposite direction" from what is indicated in its "poorly-supported conclusions."

Rothstein found several instances of conclusions not supported by data. One striking example: The MET report's data suggest that many teachers whose students have low math scores rank among the best at teaching "deeper" concepts. Yet the MET report draws the conclusion that teachers whose students score highly on standardized math tests "tend to promote deeper conceptual understanding as well."

Rothstein also found that the MET report relies heavily on standardized test scores and student surveys, which are insufficient measurements of teacher effectiveness, as teachers facing high-stakes testing will emphasize skills and topics geared toward raising test scores, while de-emphasizing those that aren't on the test. High-stakes student surveys, meanwhile, can be distorted by mischievous adolescents who may not answer honestly if they know their responses can affect teachers' compensation and careers, while teachers may be compelled to alter their practice to cater to student demands, Rothstein reported.

Then there's this from School Finance 101, by way of the National Education Policy Center:

Gates Still Doesn’t Get It! 

Trapped in a World of Circular Reasoning & Flawed Frameworks

Not much time for a thorough review of the most recent release of the Gates MET project, but here are my first cut comments on the major problems with the report. The take home argument of the report seems to be that their proposed teacher evaluation models are sufficiently reliable for prime time use and that the preferred model should include about 33 to 50% test score based statistical modeling of teacher effectiveness coupled with at least two observations on every teacher. They come to this conclusion by analyzing data on 3,000 or so teachers across multiple cities.  They arrive at the 33 to 50% figure, coupled with two observations, by playing a tradeoff game. They find – as one might expect – that prior value added of a teacher is still the best predictor of itself a year later… but that when the weight on observations is increased, the year to year correlation for the overall rating increases (well, sort of). They still find relatively low correlations between value-added ratings for teachers on state tests and ratings for the same teachers with the same kids on higher order tests.
So, what’s wrong with all of this? Here’s my quick run-down:

1. Self-validating Circular Reasoning
I’ve written several previous posts explaining the absurdity of the general framework of this research which assumes that the “true indicator of teacher effectiveness” is the following year value-added score. That is, the validity of all other indicators of teacher effectiveness is measured by their correlation to the following year value added (as well as value-added when estimated to alternative tests – with less emphasis on this). Thus, the researchers find – to no freakin’ surprise – that prior year value added is, among all measures, the best predictor of itself a year later. Wow – that’s a revelation!
As a result, any weighting scheme must include a healthy dose of value-added.  But, because their “strongest” predictor of itself analysis put too much weight on VAM to be politically palatable, they decided to balance the weighting by considering year to year reliability (regardless of validity).
The hypocrisy of their circular validity test is best revealed in this quote from the study:
Teaching is too complex for any single measure of performance to capture it accurately.
But apparently the validity of any/all other measures can be assessed by the correlation with a single measure (VAM itself)!?????
See also:
Evaluating Evaluation Systems
Weak Arguments for Using Weak Indicators

2. Assuming Data Models Used in Practice are of Comparable Quality/Usefulness
I would go so far as to say that it is reckless to assert that the new Gates findings on this relatively select sub-sample of teachers (for whom high quality data were available on all measures over multiple years) have much if any implication for the usefulness of the types of measures and models being implemented across states and districts.

I have discussed the reliability and bias issues in New York City’s relatively rich value-added model on several previous occasions. The NYC model (likely among the “better” VAMs) produces results that are sufficiently noisy from year to year to raise serious questions about their usefulness.

Certainly, one should not be making high stakes decisions based heavily on the results of that model. Further, averaging over multiple years means, in many cases, averaging scores that jump from the 30th to 70th percentile and back again.  In such cases, averaging doesn’t clarify, it masks. But what the averaging may be masking is largely noise. Averaging noise is unlikely to reveal a true signal!
Further, as I’ve discussed several times on this blog, many states and districts are implementing methods far more limited than a “high quality” VAM and in some cases states are adopting growth models that don’t attempt – or only marginally attempt – to account for any other factors that may affect student achievement over time.  Even when those models to make some attempts to account for differences in students served, in many cases as in the recent technical report on the model recommended for use in New York State, those models fail! And they fail miserably.  But despite the fact that those models fail so miserably at their central, narrowly specified task (parsing teacher influence on test score gain) policymakers continue to push for their use in making high stakes personnel decisions.

The new Gates findings – while not explicitly endorsing use of “bad” models – arguably embolden this arrogant, wrongheaded behavior!  The report has a responsibility to be clearer as to what constitutes a better and more appropriate model versus what constitutes an entirely inappropriate one.
See also:
Reliability of NYC Value-added
On the stability of being Irreplaceable (NYC data)
Seeking Practical uses of the NYC VAM data
Comments on the NY State Model
If it’s not valid, reliability doesn’t matter so much (SGP & VAM)

3. Continued Preference for the Weighted Components Model
Finally, my biggest issue is that this report and others continue to think about this all wrong. Yes, the information might be useful, but not if forced into a decision matrix or weighting system that requires the data to be used/interpreted with a level of precision or accuracy that simply isn’t there – or worse – where we can’t know if it is. (emphasis added)

Allow me to copy and paste one more time the conclusion section of an article I have coming out in late January:
As we have explained herein, value-added measures have severe limitations when attempting even to answer the narrow question of the extent to which a given teacher influences tested student outcomes. Those limitations are sufficiently severe such that it would be foolish to impose on these measures, rigid, overly precise high stakes decision frameworks.  One simply cannot parse point estimates to place teachers into one category versus another and one cannot necessarily assume that any one individual teacher’s estimate is necessarily valid (non-biased).  Further, we have explained how student growth percentile measures being adopted by states for use in teacher evaluation are, on their face, invalid for this particular purpose.  Overly prescriptive, overly rigid teacher evaluation mandates, in our view, are likely to open the floodgates to new litigation over teacher due process rights, despite much of the policy impetus behind these new systems supposedly being reduction of legal hassles involved in terminating ineffective teachers.
This is not to suggest that any and all forms of student assessment data should be considered moot in thoughtful management decision making by school leaders and leadership teams. Rather, that incorrect, inappropriate use of this information is simply wrong – ethically and legally (a lower standard) wrong. We accept the proposition that assessments of student knowledge and skills can provide useful insights both regarding what students know and potentially regarding what they have learned while attending a particular school or class. We are increasingly skeptical regarding the ability of value-added statistical models to parse any specific teacher’s effect on those outcomes. Further, the relative weight in management decision-making placed on any one measure depends on the quality of that measure and likely fluctuates over time and across settings. That is, in some cases, with some teachers and in some years, assessment data may provide leaders and/or peers with more useful insights.  In other cases, it may be quite obvious to informed professionals that the signal provided by the data is simply wrong – not a valid representation of the teacher’s effectiveness.
Arguably, a more reasonable and efficient use of these quantifiable metrics in human resource management might be to use them as a knowingly noisy pre-screening tool to identify where problems might exist across hundreds of classrooms in a large district. Value-added estimates might serve as a first step toward planning which classrooms to observe more frequently. Under such a model, when observations are completed, one might decide that the initial signal provided by the value-added estimate was simply wrong. One might also find that it produced useful insights regarding a teacher’s (or group of teachers’) effectiveness at helping students develop certain tested algebra skills.
School leaders or leadership teams should clearly have the authority to make the case that a teacher is ineffective and that the teacher even if tenured should be dismissed on that basis. It may also be the case that the evidence would actually include data on student outcomes – growth, etc. The key, in our view, is that the leaders making the decision – indicated by their presentation of the evidence – would show that they have used information reasonably to make an informed management decision. Their reasonable interpretation of relevant information would constitute due process, as would their attempts to guide the teacher’s improvement on measures over which the teacher actually had control.
By contrast, due process is violated where administrators/decision makers place blind faith in the quantitative measures, assuming them to be causal and valid (attributable to the teacher) and applying arbitrary and capricious cutoff-points to those measures (performance categories leading to dismissal).   The problem, as we see it, is that some of these new state statutes require these due process violations, even where the informed, thoughtful professional understands full well that she is being forced to make a wrong decision. They require the use of arbitrary and capricious cutoff-scores. They require that decision makers take action based on these measures even against their own informed professional judgment.
See also:
The Toxic Trifecta: Bad Measurement & Evolving Teacher Evaluation Policies
Thoughts on Data, Assessment & Informed Decision Making in Schools
And here's the Press Release from the MET folks:

Measures of Effective Teaching Project Releases Final Research Report
 Findings Help Inform Design and Implementation of High-Quality Feedback
and Evaluation Systems

The Measures of Effective Teaching (MET) project, a three-year study designed to determine how to best identify and promote great teaching, today released its third and final research report. The project has demonstrated that it is possible to identify great teaching by combining three types of measures: classroom observations, student surveys, and student achievement gains. The findings will be useful to school districts working to implement new development and evaluation systems for teachers. Such systems should not only identify great teaching, but also provide the feedback teachers need to improve their practice and serve as the basis for more targeted professional development. The MET project, which was funded by the Bill & Melinda Gates Foundation, is a collaboration between dozens of independent research teams and nearly 3,000 teacher volunteers from seven U.S. public school districts.

“Teaching is complex, and great practice takes time, passion, high-quality materials, and tailored feedback designed to help each teacher continuously grow and improve,” said Vicki Phillips, Director of Education, College Ready – U.S. Program at the Bill & Melinda Gates Foundation. “Teachers have always wanted better feedback, and the MET project has highlighted tools like student surveys and observations that can allow teachers to take control of their own development. The combination of those measures and student growth data creates actionable information that teachers can trust.”

The final report from the MET project sought to answer important questions from practitioners and policy-makers about how to identify and foster great teaching. Key findings from the report include:
  • It is possible to develop reliable measures that identify great teaching. In the first year of the study, teaching practice was measured using a combination of student surveys, classroom observations, and student achievement gains. Then, in the second year, teachers were randomly assigned to different classrooms of students. The students’ outcomes were later measured using state tests and supplemental assessments designed to measure students’ conceptual understanding in math and ability to write short answer responses following reading passages. The teachers whose students did better during the first year of the project also had students who performed better following random assignment. Moreover, the magnitude of the achievement gains they generated aligned with the predictions. This is the first large-scale study to demonstrate, using random assignment, that it is possible to identify great teaching.
  • The report describes the trade-offs involved when school systems combine different measures (student achievement gains, classroom observations, and student surveys). However, the report shows that a more balanced approach – which incorporates the student survey data and classroom observations – has two important advantages: ratings are less likely to fluctuate from year to year, and the combination is more likely to identify teachers with better outcomes on assessments other than the state tests.
  • The report provides guidance on the best ways to achieve reliable classroom observations. Many school districts currently require observations by a single school administrator. The report recommends averaging observations from more than one observer, such as another administrator in a school or a peer observer.

“If we want students to learn more, teachers must become students of their own teaching. They need to see their own teaching in a new light. Public school systems across the country have been re-thinking how they describe instructional excellence and let teachers know when they’ve achieved it,” said Tom Kane, Professor of Education and Economics at Harvard’s Graduate School of Education and leader of the MET project. “This is not about accountability. It’s about providing the feedback every professional needs to strive towards excellence.”

The Bill & Melinda Gates Foundation has developed a set of guiding principles, also released today, that states and districts may consider when building and implementing improvement-focused evaluation systems. These principles are based on both the MET project findings and the experiences of the foundation’s partner districts over the past four years.

The MET project has been dedicated to providing its findings to the field in real time. The project's first preliminary findings, released in December 2010, showed that surveying students about their perceptions of their classroom environment provides important information about teaching effectiveness as well as concrete feedback that can help teachers improve. The second set of preliminary findings, released in January 2012, examined classroom observations and offered key considerations for creating high-quality classroom observation systems.

“Great teaching is the most important in-school factor in determining student achievement. It is critical that we provide our teachers with the feedback and coaching they need to master this very challenging profession and become great teachers,” said Tom Boasberg, Superintendent, Denver Public Schools. “We all need to look at multiple sources of information to understand better our teachers’ strengths and development areas so we can provide the most targeted and useful coaching. The MET project’s findings offer new insights that are of immediate use in our classrooms and form a roadmap that districts can follow today.” 


Richard Innes said...


Just to keep this open and transparent, have you ever researched what The Great Lakes Center for Education Research and Practice actually is? Which organizations created and fund it? You can find that if you dig a bit in their web site. Hint: Try “Contact Us” then “About Us.”

You might also be interested in who provides significant funding for the National Education Policy Center (NEPC), as well. You can find that if you dig deep enough into the NEPC web site. Hint: Try “About” and then “Donate to NEPC” (Who’d think you’d find funding data there??).

Basically, I think Arthur Levine got it about right in his reports on Educating School Teachers and Educating Researchers. You can Google his interesting reports up easily by using a search term like "Arthur Levine Educating School Teachers," Etc.

By the way, economists sometimes do fairly good work on education topics, but too often they really don’t spend the time to know the data and what it really can support.

My favorite example is the large pack of economists, both liberal and conservative, that simplistically look only at “All Student” scores from the NAEP and then pronounce judgment from on high about what is happening in education. Very often, these simplistic evaluations don’t even honor the fact that NAEP is a sampled assessment with plus and minus measurement error in the scores that very often turns what looks like clear wins into only ties. Even economists with notable reputations mess up their education analyses, so I am not too impressed when someone pushes a report because the economist who created it supposedly has a great reputation.

Oh, Yeah, EdWeek’s Quality Counts’ NAEP analysis makes those mistakes, too. Why don’t you ask Skip Kifer to comment on that?

Anonymous said...

Same old story, folks with money chosing the tune we have to which we have to dance.

Anonymous said...

Just noticed Terry Holliday has come out praising the Gates report.


Anonymous said...

Holliday is no different than the politicians who rubber stamp and give him lip service. He will trumpet anything that supports his perspective and agenda. His day is coming soon - all the various initiatives and assessment schemes have been imposed, lets see those scores skyrocket now. I suspect instead we will either get a press release announcing his acceptance of another position outside of Kentucky or if he can reel that in, it will be a litany of excuses heaped on bad teachers, indifferent parents, unsupportive legislature, ineffective colleges of Education and local communities that don't value education (same as he did with Appalachian performance last year).

Richard Day said...

Richard: I don't have a problem with NEPC. They are trying to support a strong system of public schools - as fits my bias.

Richard Innes said...

Be sure you are not confusing support for "adult interests" in the school system with what is really in the best interests of students and a truly strong public school system.

Richard Day said...