Friday, January 25, 2013

Gates MET Study Produces $45 Million Worth of Questionable Methodology and Conclusions

Testing expert Skip Kifer shot me a note recently, after I had posted Walt Gardner's doubts about the quality of some educational research, which I had titled BS in Science.
If you want to see a real mess, go to the reports of MET, the Gates stuff on teacher effectiveness. The good lord only knows what they did statistically and why they did it. They have all of the fancy Greek symbols and statistical blather. The paper I perused fit your critique well;  why should I believe what they conclude if they cannot tell me what exactly they have done and for what reasons. Of course all of the reporting of the study has concluded that Gates has solved forever (meaning one years worth) all questions associated with teacher effectiveness.
I finally got around to taking a peek. Being a "qualitative guy," I'll defer to Skip and the "quants" on the particulars, but here's some of what I found.

This from The Great Lakes Center for Education Research and Practice:

Gates Foundation report on teacher evaluations seriously flawed, 
leading economist finds 

'Measures of Effective Teaching' report is based on 
flawed research and predetermined conclusions, review shows
A report on teacher evaluations recently released by the Bill and Melinda Gates Foundation has been refuted by one of the nation's leading economists, who found the widely published report to be seriously flawed.

The Gates Foundation last month released the first report of its "Measures of Effective Teaching" (MET) project, which aims to develop a reliable method for evaluating teachers. The report was thoroughly reviewed for the Think Twice think tank review project by University of California at Berkeley economist Jesse Rothstein, former chief economist at the U.S. Department of Labor.

Rothstein, who is also former senior economist for the Council of Economic Advisers, found the Gates Foundation's MET report to be based on flawed research and predetermined conclusions.
The review was produced by the National Education Policy Center (NEPC), housed at the University of Colorado at Boulder School of Education, with funding from the Great Lakes Center for Education Research and Practice.

Rothstein's analysis found the MET report draws conclusions that are not supported by its own facts, with some data in the report pointing "in the opposite direction" from what is indicated in its "poorly-supported conclusions."

Rothstein found several instances of conclusions not supported by data. One striking example: The MET report's data suggest that many teachers whose students have low math scores rank among the best at teaching "deeper" concepts. Yet the MET report draws the conclusion that teachers whose students score highly on standardized math tests "tend to promote deeper conceptual understanding as well."

Rothstein also found that the MET report relies heavily on standardized test scores and student surveys, which are insufficient measurements of teacher effectiveness, as teachers facing high-stakes testing will emphasize skills and topics geared toward raising test scores, while de-emphasizing those that aren't on the test. High-stakes student surveys, meanwhile, can be distorted by mischievous adolescents who may not answer honestly if they know their responses can affect teachers' compensation and careers, while teachers may be compelled to alter their practice to cater to student demands, Rothstein reported.

Then there's this from School Finance 101, by way of the National Education Policy Center:

Gates Still Doesn’t Get It! 

Trapped in a World of Circular Reasoning & Flawed Frameworks

Not much time for a thorough review of the most recent release of the Gates MET project, but here are my first cut comments on the major problems with the report. The take home argument of the report seems to be that their proposed teacher evaluation models are sufficiently reliable for prime time use and that the preferred model should include about 33 to 50% test score based statistical modeling of teacher effectiveness coupled with at least two observations on every teacher. They come to this conclusion by analyzing data on 3,000 or so teachers across multiple cities.  They arrive at the 33 to 50% figure, coupled with two observations, by playing a tradeoff game. They find – as one might expect – that prior value added of a teacher is still the best predictor of itself a year later… but that when the weight on observations is increased, the year to year correlation for the overall rating increases (well, sort of). They still find relatively low correlations between value-added ratings for teachers on state tests and ratings for the same teachers with the same kids on higher order tests.
So, what’s wrong with all of this? Here’s my quick run-down:

1. Self-validating Circular Reasoning
I’ve written several previous posts explaining the absurdity of the general framework of this research which assumes that the “true indicator of teacher effectiveness” is the following year value-added score. That is, the validity of all other indicators of teacher effectiveness is measured by their correlation to the following year value added (as well as value-added when estimated to alternative tests – with less emphasis on this). Thus, the researchers find – to no freakin’ surprise – that prior year value added is, among all measures, the best predictor of itself a year later. Wow – that’s a revelation!
As a result, any weighting scheme must include a healthy dose of value-added.  But, because their “strongest” predictor of itself analysis put too much weight on VAM to be politically palatable, they decided to balance the weighting by considering year to year reliability (regardless of validity).
The hypocrisy of their circular validity test is best revealed in this quote from the study:
Teaching is too complex for any single measure of performance to capture it accurately.
But apparently the validity of any/all other measures can be assessed by the correlation with a single measure (VAM itself)!?????
See also:
Evaluating Evaluation Systems
Weak Arguments for Using Weak Indicators

2. Assuming Data Models Used in Practice are of Comparable Quality/Usefulness
I would go so far as to say that it is reckless to assert that the new Gates findings on this relatively select sub-sample of teachers (for whom high quality data were available on all measures over multiple years) have much if any implication for the usefulness of the types of measures and models being implemented across states and districts.

I have discussed the reliability and bias issues in New York City’s relatively rich value-added model on several previous occasions. The NYC model (likely among the “better” VAMs) produces results that are sufficiently noisy from year to year to raise serious questions about their usefulness.

Certainly, one should not be making high stakes decisions based heavily on the results of that model. Further, averaging over multiple years means, in many cases, averaging scores that jump from the 30th to 70th percentile and back again.  In such cases, averaging doesn’t clarify, it masks. But what the averaging may be masking is largely noise. Averaging noise is unlikely to reveal a true signal!
Further, as I’ve discussed several times on this blog, many states and districts are implementing methods far more limited than a “high quality” VAM and in some cases states are adopting growth models that don’t attempt – or only marginally attempt – to account for any other factors that may affect student achievement over time.  Even when those models to make some attempts to account for differences in students served, in many cases as in the recent technical report on the model recommended for use in New York State, those models fail! And they fail miserably.  But despite the fact that those models fail so miserably at their central, narrowly specified task (parsing teacher influence on test score gain) policymakers continue to push for their use in making high stakes personnel decisions.

The new Gates findings – while not explicitly endorsing use of “bad” models – arguably embolden this arrogant, wrongheaded behavior!  The report has a responsibility to be clearer as to what constitutes a better and more appropriate model versus what constitutes an entirely inappropriate one.
See also:
Reliability of NYC Value-added
On the stability of being Irreplaceable (NYC data)
Seeking Practical uses of the NYC VAM data
Comments on the NY State Model
If it’s not valid, reliability doesn’t matter so much (SGP & VAM)

3. Continued Preference for the Weighted Components Model
Finally, my biggest issue is that this report and others continue to think about this all wrong. Yes, the information might be useful, but not if forced into a decision matrix or weighting system that requires the data to be used/interpreted with a level of precision or accuracy that simply isn’t there – or worse – where we can’t know if it is. (emphasis added)

Allow me to copy and paste one more time the conclusion section of an article I have coming out in late January:
As we have explained herein, value-added measures have severe limitations when attempting even to answer the narrow question of the extent to which a given teacher influences tested student outcomes. Those limitations are sufficiently severe such that it would be foolish to impose on these measures, rigid, overly precise high stakes decision frameworks.  One simply cannot parse point estimates to place teachers into one category versus another and one cannot necessarily assume that any one individual teacher’s estimate is necessarily valid (non-biased).  Further, we have explained how student growth percentile measures being adopted by states for use in teacher evaluation are, on their face, invalid for this particular purpose.  Overly prescriptive, overly rigid teacher evaluation mandates, in our view, are likely to open the floodgates to new litigation over teacher due process rights, despite much of the policy impetus behind these new systems supposedly being reduction of legal hassles involved in terminating ineffective teachers.
This is not to suggest that any and all forms of student assessment data should be considered moot in thoughtful management decision making by school leaders and leadership teams. Rather, that incorrect, inappropriate use of this information is simply wrong – ethically and legally (a lower standard) wrong. We accept the proposition that assessments of student knowledge and skills can provide useful insights both regarding what students know and potentially regarding what they have learned while attending a particular school or class. We are increasingly skeptical regarding the ability of value-added statistical models to parse any specific teacher’s effect on those outcomes. Further, the relative weight in management decision-making placed on any one measure depends on the quality of that measure and likely fluctuates over time and across settings. That is, in some cases, with some teachers and in some years, assessment data may provide leaders and/or peers with more useful insights.  In other cases, it may be quite obvious to informed professionals that the signal provided by the data is simply wrong – not a valid representation of the teacher’s effectiveness.
Arguably, a more reasonable and efficient use of these quantifiable metrics in human resource management might be to use them as a knowingly noisy pre-screening tool to identify where problems might exist across hundreds of classrooms in a large district. Value-added estimates might serve as a first step toward planning which classrooms to observe more frequently. Under such a model, when observations are completed, one might decide that the initial signal provided by the value-added estimate was simply wrong. One might also find that it produced useful insights regarding a teacher’s (or group of teachers’) effectiveness at helping students develop certain tested algebra skills.
School leaders or leadership teams should clearly have the authority to make the case that a teacher is ineffective and that the teacher even if tenured should be dismissed on that basis. It may also be the case that the evidence would actually include data on student outcomes – growth, etc. The key, in our view, is that the leaders making the decision – indicated by their presentation of the evidence – would show that they have used information reasonably to make an informed management decision. Their reasonable interpretation of relevant information would constitute due process, as would their attempts to guide the teacher’s improvement on measures over which the teacher actually had control.
By contrast, due process is violated where administrators/decision makers place blind faith in the quantitative measures, assuming them to be causal and valid (attributable to the teacher) and applying arbitrary and capricious cutoff-points to those measures (performance categories leading to dismissal).   The problem, as we see it, is that some of these new state statutes require these due process violations, even where the informed, thoughtful professional understands full well that she is being forced to make a wrong decision. They require the use of arbitrary and capricious cutoff-scores. They require that decision makers take action based on these measures even against their own informed professional judgment.
See also:
The Toxic Trifecta: Bad Measurement & Evolving Teacher Evaluation Policies
Thoughts on Data, Assessment & Informed Decision Making in Schools
And here's the Press Release from the MET folks:

Measures of Effective Teaching Project Releases Final Research Report
 Findings Help Inform Design and Implementation of High-Quality Feedback
and Evaluation Systems

The Measures of Effective Teaching (MET) project, a three-year study designed to determine how to best identify and promote great teaching, today released its third and final research report. The project has demonstrated that it is possible to identify great teaching by combining three types of measures: classroom observations, student surveys, and student achievement gains. The findings will be useful to school districts working to implement new development and evaluation systems for teachers. Such systems should not only identify great teaching, but also provide the feedback teachers need to improve their practice and serve as the basis for more targeted professional development. The MET project, which was funded by the Bill & Melinda Gates Foundation, is a collaboration between dozens of independent research teams and nearly 3,000 teacher volunteers from seven U.S. public school districts.

“Teaching is complex, and great practice takes time, passion, high-quality materials, and tailored feedback designed to help each teacher continuously grow and improve,” said Vicki Phillips, Director of Education, College Ready – U.S. Program at the Bill & Melinda Gates Foundation. “Teachers have always wanted better feedback, and the MET project has highlighted tools like student surveys and observations that can allow teachers to take control of their own development. The combination of those measures and student growth data creates actionable information that teachers can trust.”

The final report from the MET project sought to answer important questions from practitioners and policy-makers about how to identify and foster great teaching. Key findings from the report include:
  • It is possible to develop reliable measures that identify great teaching. In the first year of the study, teaching practice was measured using a combination of student surveys, classroom observations, and student achievement gains. Then, in the second year, teachers were randomly assigned to different classrooms of students. The students’ outcomes were later measured using state tests and supplemental assessments designed to measure students’ conceptual understanding in math and ability to write short answer responses following reading passages. The teachers whose students did better during the first year of the project also had students who performed better following random assignment. Moreover, the magnitude of the achievement gains they generated aligned with the predictions. This is the first large-scale study to demonstrate, using random assignment, that it is possible to identify great teaching.
  • The report describes the trade-offs involved when school systems combine different measures (student achievement gains, classroom observations, and student surveys). However, the report shows that a more balanced approach – which incorporates the student survey data and classroom observations – has two important advantages: ratings are less likely to fluctuate from year to year, and the combination is more likely to identify teachers with better outcomes on assessments other than the state tests.
  • The report provides guidance on the best ways to achieve reliable classroom observations. Many school districts currently require observations by a single school administrator. The report recommends averaging observations from more than one observer, such as another administrator in a school or a peer observer.

“If we want students to learn more, teachers must become students of their own teaching. They need to see their own teaching in a new light. Public school systems across the country have been re-thinking how they describe instructional excellence and let teachers know when they’ve achieved it,” said Tom Kane, Professor of Education and Economics at Harvard’s Graduate School of Education and leader of the MET project. “This is not about accountability. It’s about providing the feedback every professional needs to strive towards excellence.”

The Bill & Melinda Gates Foundation has developed a set of guiding principles, also released today, that states and districts may consider when building and implementing improvement-focused evaluation systems. These principles are based on both the MET project findings and the experiences of the foundation’s partner districts over the past four years.

The MET project has been dedicated to providing its findings to the field in real time. The project's first preliminary findings, released in December 2010, showed that surveying students about their perceptions of their classroom environment provides important information about teaching effectiveness as well as concrete feedback that can help teachers improve. The second set of preliminary findings, released in January 2012, examined classroom observations and offered key considerations for creating high-quality classroom observation systems.

“Great teaching is the most important in-school factor in determining student achievement. It is critical that we provide our teachers with the feedback and coaching they need to master this very challenging profession and become great teachers,” said Tom Boasberg, Superintendent, Denver Public Schools. “We all need to look at multiple sources of information to understand better our teachers’ strengths and development areas so we can provide the most targeted and useful coaching. The MET project’s findings offer new insights that are of immediate use in our classrooms and form a roadmap that districts can follow today.” 

Texas Approves Charters for Affluent Anglos

Long-time KSN&C readers may recall that I support (albeit weakly) a Kentucky charter school law that permits charters in a narrow set of circumstances - where long-term failure has attended certain public schools and the local school boards have not figured out a way to deliver equal educational opportunity for poor kids. That position forces me to oppose efforts by charter promoters in more case than not. Efforts to allow charter schools in Kentucky have always left the door open to what I consider to be abuses of the system at the expense of the children at large - such as a charter set up to compete with already successful schools, or ones that could open the door to religious segregation. The story below is Exhibit A. It's what I believe many of Kentucky's charter promoters really want - even though they spend all of their time talking about the poor minority children they hope to protect (while railing against providing medical coverage for the same folks). Diane Ravitch asked the right question about charters. Do Affluent White Neighborhoods Need Charter Schools?

This from Behind FrenemyLines:

Wall Street gang now offer education choices

About a quarter of the kids in the San Antonio Independent School District attend charter schools. Most are the low-income, minority students we think about when we imagine providing innovative opportunities for kids stuck in failing public schools in bad neighborhoods. For a long time, school reform has targeted only kids from poor families. You know, the lucky ones who get those free lunches.

Starting this fall, though, no longer will Texas exclude upper-middle class white kids like mine from the gravy train of school choice. Last November, the State Board of Education approved a charter allowing Great Hearts Academies to open a school in North San Antonio, the wealthier, whiter section of a majority-Hispanic city.

Great Hearts Academies operates out of Arizona, where they survive not just on public funding that would normally go to public schools but also on mandatory fees as well as contributions from students’ families, pricing Great Hearts out of reach for most San Antonio families. In other words, upper-middle class Anglos are finally getting a taxpayer-subsidized private school. Our long nightmare of being stuck in high-performing, better-funded public schools is almost over.

If that’s not what you have in mind when you think of school choice, you’re not alone. Great Hearts tried this in Nashville, but the school board rejected the charter application, arguing reasonably that creating a government-funded private school to serve an affluent, white neighborhood constituted segregation. It’s exactly what they’re planning in North San Antonio, except our school board approved it.

Private tuition and public subsidies only provide enough money to pay the teachers, buy textbooks and keep the lights on. To build schools, you need to go into massive debt. But don’t worry, because our need to borrow millions of dollars creates an investment opportunity for Wall Street investment bankers. Apparently charter schools are “a favorite cause of many of the wealthy founders of New York hedge funds.” The word you’re probably looking for is “yippee.”

Public school bonds are a safe investment, but low risk means lower reward, in this case an average 3% return on general-obligation funds used to raise money to build schools. But debt for charter schools runs an average of 3.8% higher than general-obligation bonds, and charter schools even qualify for federal tax credits under the Community Renewal Tax Relief Act of 2000.

As every investment prospectus says in small type, investments carry risk. In this case, 3.91% of charter-school bonds are in default versus 0.03% for public schools. And since 1992, 15% of charters have closed, including 52 in Texas.

Despite the risks, charter schools are big business. JPMorgan Chase of worldwide economic meltdown fame is bullish on charter school construction.

“Many charter schools have expanded access to academic opportunities for students in all types of communities, so we shouldn’t let tough economic times bring them down,” said JPMorgan Chase Chairman and CEO Jamie Dimon.

Screen Shot 2013-01-17 at 4.55.52 PMThis is the same Jamie Dimon who thought mortgage-backed securities were foolproof, who was forced to take $25 billion of our money in the bank bailout, who wrongly foreclosed on military families, who overcharged 4,000 other military families by $2 million, and who then lost $2 billion of our money in what amounted to the kind of gambling that only happens after 4 am in Las Vegas. Let’s absolutely have this guy underwrite our schools. What could go wrong that hasn’t already many times over?

Subjecting our public school system to the free market requires us to accept that hopped-up Wall Street bankers will mess up, schools will close, and sooner or later, someone will have to choose between increasing shareholder returns and improving some kid’s education. Failure is not only an option. When it comes to Wall Street, failure is inevitable.

The specter of resegregating our schools along racial and economic lines under the cloak of school choice presents a more daunting future for a state that is growing poorer, browner, and younger. When it comes to schools, the question isn’t whether we’re going to have charter schools or public schools. We have both now. When it comes to schools, the real choice is whether we are all in this together or if it’s every man for himself.

Congressional Inaction Still Holding Up Education Laws

This from Education Week:

Crush of Education Laws Awaits Renewal in Congress
The new, still-divided Congress that took office this month faces a lengthy list of education policy legislation that is either overdue for renewal or will be soon, in a political landscape that remains consumed with fiscal issues.

But it's tough to say whether there will be much action on all that outdated legislation—including the No Child Left Behind Act, which has awaited reauthorization since 2007. The cast of characters in Washington is virtually unchanged since before the 2012 elections—which left President Barack Obama in the White House, Democrats in control of the Senate, and Republicans in control of the House of Representatives.

So far, that's led to a serious legislative logjam on everything from a limited bill renewing education research programs to the budget of the entire U.S. government.

And one education bill—the reauthorization of the Child Care and Development Block Grant program, which governs some key early-child-care grants—hasn't gotten a makeover in more than a decade and half. It was last reauthorized in 1996, when President Bill Clinton was running for his second term.

"This is unprecedented," said Jack Jennings, who served as an aide for Democrats on the House Education committee from 1967 to 1994. Mr. Jennings said that when he worked on Capitol Hill nearly two decades ago, lawmakers kept to a schedule, tackling big reauthorizations, such as for the Elementary and Secondary Education Act or the Higher Education Act, roughly every two years. That "discipline" is gone, he said.

He blames both parties in Congress for the lack of movement. Republicans, he said, have been adamant about a limited federal role in education, making compromise difficult. And Democrats could have passed a renewal of the ESEA law—and other key education legislation—in 2009 and 2010 when they controlled both houses of Congress and the White House, he added.

"Congress should be ashamed of itself," said Mr. Jennings, who founded the Center on Education Policy, a research and advocacy organization, after leaving the Hill and is now retired. "The same people who neglect their legislative duties bewail the sad state of American education."

The appropriations committees—which control funding—typically continue to finance education programs, even after their authorizations have long expired, noted Mr. Jennings. The problem, he said, is that programs can become outdated, then viewed as ineffective, and finally, slashed, as Congress looks to trim spending. Without authorizing legislation, he said, "there is the potential for major disruption."

Vic Klatt, who worked for years as a top aide to Republicans on the House education committee, said he couldn't remember a time when the congressional education to-do list had been this lengthy. Lawmakers' inaction means that the Obama administration "has been able to get away with doing whatever they want, whenever they want," said Mr. Klatt, who is now a principal at Penn Hill Group, a government-relations firm in Washington.

Searching for Stability

An obvious case in point: Because of congressional inaction on the ESEA, the administration has issued waivers from the No Child Left Behind Act, the law's current iteration, allowing states to get out from under key mandates in exchange for embracing the administration's education redesign priorities.

But the waivers don't provide enough predictability for states, most of which are in the midst of moving toward new academic standards, assessments, and teacher-evaluation systems, said Peter Zamora, the director of federal relations for the Council of Chief State School Officers. He also noted that another federal debt-ceiling fight is in the offing, and automatic spending cuts of 8 percent for key federal programs loom.

"It becomes particularly challenging for practitioners to plan, if you don't know what NCLB will look like, or what your federal funding will look like," he said. "We're urging Congress to reassert itself and provide some stability."

It's unclear just how much the administration will push for an ESEA renewal in the coming Congress. U.S. Secretary of Education Arne Duncan told the CCSSO in a November speech in Savannah, Ga., that the administration needs to see interest from Congress before it leads on ESEA renewal.

The House and Senate education committees approved bills in the previous Congress to update the law. While both measures would have given considerably more flexibility to states in creating accountability systems, the two chambers clashed on critical issues such as the role of the federal government in school improvement, teacher effectiveness, and the scope of the Education Department.

Still, Sen. Tom Harkin, D-Iowa, the chairman of the Senate, Health, Education, Labor, and Pensions Committee, and Rep. John Kline, R-Minn., the chairman of the House Education and the Workforce Committee, have each listed ESEA renewal as a top priority for the new year.

And both lawmakers say they're optimistic that the new Congress will be a productive one for education.

"Traditionally, this has been an area where we can find bipartisan agreement, and I hope that will continue in the new Congress," said Sen. Harkin in an email.

An aide to Sen. Harkin said lawmakers' current workload is typical of recent Congresses, in which reauthorizations of major laws often go through several iterations before they are finally passed. The aide cited the ESEA bill, approved by the panel in fall 2011, as an important step forward in renewing the law, but wasn't ready to say which portions of that legislation would serve as the basis for a new measure this year.

The Senate committee will begin the renewal process by taking a closer look at waiver Implementation, likely through a hearing, a Harkin aide said.

And Mr. Kline appears ready to roll up his sleeves."Clearly, we have our work cut out for us in the 113th Congress," he said in an email. "I hope my committee colleagues and I can work together in a bipartisan fashion ... on these important issues."

Get in Line

Meanwhile renewals of more narrowly tailored laws—such as the Individuals with Disabilities Education Act—are likely to languish behind the ESEA, since the main K-12 law sets the stage for other policy negotiations.

"ESEA holds everything up," said Lindsay Jones, the senior director of policy and advocacy services at the Council for Exceptional Children, a Washington-based organization that focuses on students in special and gifted education.

A delay in updating the IDEA, which was last renewed in 2004, leaves unsettled policy questions that are likely to be dealt with in the overhauls of both laws, such as how districts should approach assessments for students in special education.
And an unsettled ESEA reauthorization is also bogging down the renewal of the Education Sciences Reform Act, or ESRA, which governs the Institute of Education Sciences and is seen as a companion bill to the ESEA.

States and districts are "really moving the ball forward in areas where there's not a lot of research," such as school turnarounds and assessments, said Mr. Zamora. "We would like to see ESRA better aligned to state practice."

If the logjam breaks, it could offer an opportunity to harmonize different, but related, pieces of legislation, said Lisa Guernsey, the director of the Early Education Initiative at the New America Foundation, a think tank in Washington.

For instance, she suggested that Congress may want to consider renewing related legislation in batches, dealing all at once with the three measures that touch on early-childhood education—Head Start, Child Care and Development Block Grant Act, and ESEA.

'Unprecedented' Logjam

Longtime Capitol Hill aides from both sides of the political aisle can't remember a time when Congress was this jammed up.

The 113th Congress, which took office this month, has a long to-do list when it comes to education legislation. Among pending renewals:
Carl D. Perkins Career and Technical Education Act: Governs vocational education programs and is the largest federal program for high schools. Last renewed in 2006.
Child Care and Development Block Grant Act: Governs major child-care grants. Last renewed in 1996.
Education Sciences Reform Act: Governs the Institute of Education Sciences. Last renewed in 2002.
Elementary and Secondary Education Act: Governs Title I and other key K-12 education programs. Most recent iteration is the No Child Left Behind Act. Last renewed in 2002.
Head Start Act: Governs a nearly $8 billion program that offers early-childhood education services to low-income families. Last renewed in 2007.
Higher Education Act: Governs teacher education programs, as well as student financial aid and college-access programs, including GEAR-UP and TRIO. Last renewed in 2008.
Individuals with Disabilities Education Act: Governs special education programs. Last renewed in 2004.
Workforce Investment Act: Governs job training programs. Last renewed in 1998.

Saturday, January 19, 2013

What Do International Tests Really Show About U. S. Student Performance?

Because social class inequality is greater in the United States
than in any of the countries with which we can reasonably be compared,
the relative performance of U.S. adolescents is better than it appears
when countries’ national average performance is conventionally compared.

--From a report, released by the Economic Policy Institute
and co-authored by Martin Carnoy and Richard Rothstein. 

 This from The Daily Koz:

I was allowed to see an embargoed version of the report ahead of the official release in order to draft this post about it -  I will be teaching when the embargo is lifted, and this is scheduled to go live automatically.

I asked to see the report both because of the importance of the topic and the credibility of the authors.
Let me start with the authors. Martin Carnoy is a Labor Economist with a special interest in the relationship between the economy and the educational system.  This is especially relevant given how the concerns we so often hear about our educational system are usually phrased in terms of the economic risk to the nation.  Carnoy, besides his work at EPI, holds an endowed chair in the Stanford Graduate School of Education, where he chairs chairs the International and Comparative Education program.  In short, he is one of America's best experts on international comparisons.

Richard Rothstein is senior fellow of the Chief Justice Earl Warren Institute on Law and Social Policy at the University of California (Berkeley) School of Law.  He is the author of numerous books on education.  He has served as the principal education writer for the New York Times. His research and writing has often focused on the issues of how economic inequality affects education.

Clearly anything these two gentleman have to offer about education is worth paying attention to...

There are two major sets of international tests.  TIMMS, which now stands for Trends in International Mathematics and Science Study, is given to 4th and 8th grade students on a 4-year cycle that began in 1995, and it intends to provide trends in mathematics and science achievement.  It is now given in over 60 nations, although not all nations have participated in all of the administrations.

The people who produce TIMMS also produce PIRLS (Progress in International
Reading Literacy Study), which has been given on a 5-year cycle.  Both TIMMS and PIRLS were given in 2011.  PIRLS is NOT part of the analysis of Carnoy and Rothstein.
PISA, which stands for Programme for International Student Assessment, is offered by OECD, the Organization for Economic Cooperation and Development.  First given in 1997, PISA is given to 15 year olds on a 3-year cycle assessing their competencies in  reading, mathematics and science.

In each case, the tests are given to a sample of students in a participating nation.

We regularly hear people bemoaning the ranking of the US on such exams.  In the past various people have pointed out a number of problems with placing too much weight on such rankings.  For example, in some past administrations of  TIMMS some nations exempted non-native speakers of the primary languages. Others eliminated questions on coastal biology on the grounds that their nation lacked a coastline (something the US did not do for students in the middle of our vast nation).

Often in recent years we have been held up and compared to the countries that did best, most notably Finland, more recently also  South Korea, Singapore, Shanghai, and Canada.  Singapore is largely a city-state, and Shanghai is very much unlike the rest of China.

Carnoy and Rothstein decided to do their explorations including matching US performance against that of Finland, South Korea and Canada as high-performing countries, and three large post-industrial countries with a number of similarities with the US economically:  the UK, France, and Germany.
To provide further depth to their analysis of US performance, the authors also examined data from various versions of the National Assessment of Educational Progress (NAEP), often describe by educational researchers as the nation's report card.  Like TIMMS and PISA, it uses a random sample of students from across the nation.

The last piece of background information I want to provide is this:  the data sets for the tests are eventually made available to researchers, which allows for the kind of analysis done by Carnoy and Rothstein.  Unfortunately most news organizations and politicians never get beyond the ordinal rankings - where the US ranks compared to other nations.  One danger in focusing on rankings is that there may be little meaningful difference between, say, being in 3rd place and being in 11th place.  Think of it like this:   what is the difference between a .287 hitter and a.278 hitter?

Historian Diane Ravitch likes to remind us that the US has always done poorly on such international comparisons.  Others - including me -  will point out that the US has a significantly higher degree of children in poverty (in our case measured by Free and Reduced Lunch statistics) that just about any other country in the comparisons (Mexico has more than us) - for example, Finland has a childhood poverty rate of less than 5% while we have one of more than 20%.  Those making such comparisons like to point out that our schools with less than 10% students in poverty tend to perform about as well as the high scoring nations like Finland, even with a poverty rate twice that of Finland.

This actually points to something Richard Rothstein reminded me of -  that the issue is not merely the percentage of students in poverty.  Students with low incomes may not necessarily be culturally deprived, which is why other indicators are also important.  If one is looking at cultural deficits with which a child arrives at school, one might want to look at the educational attainment of the mothers and/or the number of books in the home.  If I might, imagine a child of two parents at least one of whom is a graduate student.  the income of that family might be quite low, yet the mother will probably be at least a college graduate and the family is likely to have significant educational assets (such as books) in their home.

Of greater importance, it is likely that child will be attending a school not heavily populated by children from poor and/or educationally deprived backgrounds.  There is a large body of evidence that says poor or culturally deprived children who attend schools where they are in the minority tend to outperform most students in schools where poor or culturally deprived students are a distinct, even heavy, majority.

So let me offer some of the key takeaways from the report.  If you do not want to wade through all the tables and detailed discussions (something educational policy wonks like me do like to do), you can get a very good overview from the executive summary, which runs for about 5 of the total of 100 pages of the report (for which the very detailed 41 end notes begin on p. 92 and go to page 97, where the references begin, continuing to page 99).

The authors conclude, after thoroughly examining all the data, that
In general, we find that test data are too complex and oversimplified to permit meaningful policy conclusions regarding U. S. educational performance without deeper study of test results and methodology. (p.2)
 Of course, that does not stop policy makers and those who have an agenda of demeaning U. S. public schools.

I am going to quote the two sets of findings the authors place in bold italics, a formatting I will also follow.  I will the offer some of the detail they offer to support these findings.

Because social class inequality is greater in the United States than in any of the countries with which we can reasonably be compared, the relative performance of U. S. adolescents is better than it appears when countries' national average performance is conventionally compared. (p.2)
To start with, we have more children in poverty.  Exacerbating the results is that a sampling error in the most recent version of PISA
resulted in the most-disadvantaged U. S. students being over-represented in the overall U. S. test-taker sample.  This error further depressed the reported average U. S. test score. (p. 2)
To this let me add the following:  we have a greater proportion of our young people from families in the bottom of the social class distribution than the six nations to which you are comparing.  Carnoy and Rothstein define social class by characteristics other than income, for the reasons noted previously.

However, one needs to also consider this:
At all points in the social class distribution, U. S. students perform worse, and in many cases, substantially worse, than students in a group of top-scoring countries (Canada, Finland, and Korea).  Although controlling for social class distribution would narrow the difference in average scores between these countries and the United States, it would not eliminate it. (p. 3)
Because not only educational effectiveness but also countries' social class composition changes over time, comparisons of test score trends over time by social class group provide more useful information to policymakers than comparisons of total average test scores at one point in time or even changes in total average test scores over time.  (p. 3)

The authors then note the performance of our lowest social class students has been improving while that of similar students in both groups of compared countries, top-scoring and similarly post-industrial, has been falling.  When we look at our middle class and advantaged students, while their scores have not been improving, in some of the countries in both groups there have been declines in performance.

There are methodological issues that flow both from how samples were obtained for the international tests and how the items used to assess compare to what is being taught that also impact the results.

 For example, the percentage of students from schools with higher percentages of students eligible for  the Free and Reduced Lunch assistance on the 2009 PISA may well have distorted the results.  There is a detailed discussion of this in the material around Table 25, which appears on p. 70.  the authors note that while the percentage of students in the program who took the test was close to their share of the population of American school - 35% compared to 36% - the distribution was that 40% of students taken that version of PISA were from high-poverty schools (50% or more of the students eligible) while only 23% of US students attend such schools.  If one looks at schools with 75% eligibility, 16% of the sample came from such schools while only 6% of US students attend such schools.  Since we know that students of similar economic background perform better in schools with a lower percentage of students from distressed economic circumstances, the authors argue that there is at least a strong suggestion that the overall score for the U. S. is distorted.

Second, the distribution of students in poverty is not necessarily the same for those being tested on math and those being tested on reading on the same application of the test.   The authors point to a specific inconsistency for Finland.

Third, the population samples are not stable from application to application.  This MAY be an artifact of the changing of the school populations during that time, but also may represent sampling error which can explain some of the variation in scores over time.

Fourth, the content that is sampled for Mathematics varies from test to test, which may account for some of the variance in results between different applications of the same test.

Fifth, the math being sampled may or may not match the curricular aims of a particular nation.   A greater emphasis on questioning students on geometry would advantage a nation which emphasizes geometry by the age the students are tested while disadvantaging a nation which has a greater emphasis on things like algebra and statistics. US math curricula tend to have a greater emphasis on the latter, while the tests have increasingly sampled geometry.  Chances in the percentages of the various mathematical domains being sampled can impact a nation's performance over time without necessarily representing accurately possible changes in actual mathematical performance by the students.

Sixth, changing math items to greater emphasis on problem-solving as PISA has done makes the math assessments effectively a reading comprehension test as well as a mathematics test.  Given the impact that parental literacy has on reading comprehension means the results may be more accurate for mathematical performance for higher- than for lower-class students.  It also may advantage ON MATH those nations with more effective literacy programs.

Sixth -  testing at a particular age may represent different years of educational background.  Nations begin formal schooling at different ages, but this is further conflated by the availability of early childhood education and what that entails.  Here it is worth noting that the U. S. continues to lag in providing universal access to early childhood education, which along with a higher degree of poverty and a larger share of population with indicators such a lower number of books in the home and/or higher percentage of mothers with low educational attainment may also impact the results we are seeing.

There is lots more.

The authors explain the methodologies they use to analyze the data.   They raise cautions about their own analysis where appropriate.

Secretary of Education Duncan issued a now-famous statement after the issuance of the 2011 TIMMS scores, calling the results "unacceptable" and making his arguments by combining those results with those of the 2009 PISA, whose results he argued justified the policies he had been pursuing, which I remind readers included Race to the Top and the BluePrint for Education, and now include the requirements he imposed for states to receive waivers on the requirements of No Child Left Behind.  Carnoy and Rothstein, in one of the most important paragraphs of this document, write
This conclusion, however, is oversimplified, exaggerated, and misleading.  It ignores the complexity of the content of test results and may well be leading policymakers to pursue inappropriate and even harmful reforms that change aspects of the U. S. education system that may be working well and neglect aspects that may be working poorly.  (p. 6)
 For example, low scores on a test that emphasizes geometry over statistics and probability might lead to changing the emphasis, but as the authors note
It certainly might not be good public policy to reduce curricular emphasis on statistics and probability, skills essential to an educated citizenry in a democracy, in order to make more time available for geometry. (p.8)
By now you should have a sense of the thoroughness of this report.  It may not be the kind of reading the average person would select for her bedtime (unless one expected to become drowsy while wading through the detailed tables and explanations).  It is nevertheless something that should be essential reading for those who want to draw APPROPRIATE conclusions from the international comparison.

As professional educator whose life work is severely impacted by the decisions of policy makers, and as one who engages in the discourse over education policy that is often shaped by journalists, I fervently hope that both policy makers and journalists will take the time to educate themselves about the real information in such tests. so that we do not continue to pursue policies based on false understandings that are likely to be counterproductive to real learning without necessarily achieving the illusory goal of higher relative performance on international comparisons.

As the authors note on page 5
To make judgments only on the basis of statistically significant differences in national average scores, on only one test, at only one point in time, without regard to social class context or curricular or population sapling methodologies, is the worst possible choice.  But unfortunately, this is how most policymakers and analysts approach the field.
The field is educational policy.

It shapes the future of our nation.

It involves hundreds of billions of taxpayer dollars.

It shapes the development of our most precious resource, our young people.

Might not it make sense to understand what data really means before we commit major resources to attempt to change that data?

One would think.

I for one am exceedingly grateful for the magnificent effort Carnoy and Rothstein have made to help us with that understanding.

If you truly want to understand what the data from PISA and TIMMS means, this report is essential reading.

If you want a broad overview but do not want to drill down, at least read the Executive Summary.