Saturday, January 19, 2013

What Do International Tests Really Show About U. S. Student Performance?

Because social class inequality is greater in the United States
than in any of the countries with which we can reasonably be compared,
the relative performance of U.S. adolescents is better than it appears
when countries’ national average performance is conventionally compared.

--From a report, released by the Economic Policy Institute
and co-authored by Martin Carnoy and Richard Rothstein. 

 This from The Daily Koz:

I was allowed to see an embargoed version of the report ahead of the official release in order to draft this post about it -  I will be teaching when the embargo is lifted, and this is scheduled to go live automatically.

I asked to see the report both because of the importance of the topic and the credibility of the authors.
Let me start with the authors. Martin Carnoy is a Labor Economist with a special interest in the relationship between the economy and the educational system.  This is especially relevant given how the concerns we so often hear about our educational system are usually phrased in terms of the economic risk to the nation.  Carnoy, besides his work at EPI, holds an endowed chair in the Stanford Graduate School of Education, where he chairs chairs the International and Comparative Education program.  In short, he is one of America's best experts on international comparisons.

Richard Rothstein is senior fellow of the Chief Justice Earl Warren Institute on Law and Social Policy at the University of California (Berkeley) School of Law.  He is the author of numerous books on education.  He has served as the principal education writer for the New York Times. His research and writing has often focused on the issues of how economic inequality affects education.

Clearly anything these two gentleman have to offer about education is worth paying attention to...

There are two major sets of international tests.  TIMMS, which now stands for Trends in International Mathematics and Science Study, is given to 4th and 8th grade students on a 4-year cycle that began in 1995, and it intends to provide trends in mathematics and science achievement.  It is now given in over 60 nations, although not all nations have participated in all of the administrations.

The people who produce TIMMS also produce PIRLS (Progress in International
Reading Literacy Study), which has been given on a 5-year cycle.  Both TIMMS and PIRLS were given in 2011.  PIRLS is NOT part of the analysis of Carnoy and Rothstein.
PISA, which stands for Programme for International Student Assessment, is offered by OECD, the Organization for Economic Cooperation and Development.  First given in 1997, PISA is given to 15 year olds on a 3-year cycle assessing their competencies in  reading, mathematics and science.

In each case, the tests are given to a sample of students in a participating nation.

We regularly hear people bemoaning the ranking of the US on such exams.  In the past various people have pointed out a number of problems with placing too much weight on such rankings.  For example, in some past administrations of  TIMMS some nations exempted non-native speakers of the primary languages. Others eliminated questions on coastal biology on the grounds that their nation lacked a coastline (something the US did not do for students in the middle of our vast nation).

Often in recent years we have been held up and compared to the countries that did best, most notably Finland, more recently also  South Korea, Singapore, Shanghai, and Canada.  Singapore is largely a city-state, and Shanghai is very much unlike the rest of China.

Carnoy and Rothstein decided to do their explorations including matching US performance against that of Finland, South Korea and Canada as high-performing countries, and three large post-industrial countries with a number of similarities with the US economically:  the UK, France, and Germany.
To provide further depth to their analysis of US performance, the authors also examined data from various versions of the National Assessment of Educational Progress (NAEP), often describe by educational researchers as the nation's report card.  Like TIMMS and PISA, it uses a random sample of students from across the nation.

The last piece of background information I want to provide is this:  the data sets for the tests are eventually made available to researchers, which allows for the kind of analysis done by Carnoy and Rothstein.  Unfortunately most news organizations and politicians never get beyond the ordinal rankings - where the US ranks compared to other nations.  One danger in focusing on rankings is that there may be little meaningful difference between, say, being in 3rd place and being in 11th place.  Think of it like this:   what is the difference between a .287 hitter and a.278 hitter?

Historian Diane Ravitch likes to remind us that the US has always done poorly on such international comparisons.  Others - including me -  will point out that the US has a significantly higher degree of children in poverty (in our case measured by Free and Reduced Lunch statistics) that just about any other country in the comparisons (Mexico has more than us) - for example, Finland has a childhood poverty rate of less than 5% while we have one of more than 20%.  Those making such comparisons like to point out that our schools with less than 10% students in poverty tend to perform about as well as the high scoring nations like Finland, even with a poverty rate twice that of Finland.

This actually points to something Richard Rothstein reminded me of -  that the issue is not merely the percentage of students in poverty.  Students with low incomes may not necessarily be culturally deprived, which is why other indicators are also important.  If one is looking at cultural deficits with which a child arrives at school, one might want to look at the educational attainment of the mothers and/or the number of books in the home.  If I might, imagine a child of two parents at least one of whom is a graduate student.  the income of that family might be quite low, yet the mother will probably be at least a college graduate and the family is likely to have significant educational assets (such as books) in their home.

Of greater importance, it is likely that child will be attending a school not heavily populated by children from poor and/or educationally deprived backgrounds.  There is a large body of evidence that says poor or culturally deprived children who attend schools where they are in the minority tend to outperform most students in schools where poor or culturally deprived students are a distinct, even heavy, majority.

So let me offer some of the key takeaways from the report.  If you do not want to wade through all the tables and detailed discussions (something educational policy wonks like me do like to do), you can get a very good overview from the executive summary, which runs for about 5 of the total of 100 pages of the report (for which the very detailed 41 end notes begin on p. 92 and go to page 97, where the references begin, continuing to page 99).

The authors conclude, after thoroughly examining all the data, that
In general, we find that test data are too complex and oversimplified to permit meaningful policy conclusions regarding U. S. educational performance without deeper study of test results and methodology. (p.2)
 Of course, that does not stop policy makers and those who have an agenda of demeaning U. S. public schools.

I am going to quote the two sets of findings the authors place in bold italics, a formatting I will also follow.  I will the offer some of the detail they offer to support these findings.

Because social class inequality is greater in the United States than in any of the countries with which we can reasonably be compared, the relative performance of U. S. adolescents is better than it appears when countries' national average performance is conventionally compared. (p.2)
To start with, we have more children in poverty.  Exacerbating the results is that a sampling error in the most recent version of PISA
resulted in the most-disadvantaged U. S. students being over-represented in the overall U. S. test-taker sample.  This error further depressed the reported average U. S. test score. (p. 2)
To this let me add the following:  we have a greater proportion of our young people from families in the bottom of the social class distribution than the six nations to which you are comparing.  Carnoy and Rothstein define social class by characteristics other than income, for the reasons noted previously.

However, one needs to also consider this:
At all points in the social class distribution, U. S. students perform worse, and in many cases, substantially worse, than students in a group of top-scoring countries (Canada, Finland, and Korea).  Although controlling for social class distribution would narrow the difference in average scores between these countries and the United States, it would not eliminate it. (p. 3)
Because not only educational effectiveness but also countries' social class composition changes over time, comparisons of test score trends over time by social class group provide more useful information to policymakers than comparisons of total average test scores at one point in time or even changes in total average test scores over time.  (p. 3)

The authors then note the performance of our lowest social class students has been improving while that of similar students in both groups of compared countries, top-scoring and similarly post-industrial, has been falling.  When we look at our middle class and advantaged students, while their scores have not been improving, in some of the countries in both groups there have been declines in performance.

There are methodological issues that flow both from how samples were obtained for the international tests and how the items used to assess compare to what is being taught that also impact the results.

 For example, the percentage of students from schools with higher percentages of students eligible for  the Free and Reduced Lunch assistance on the 2009 PISA may well have distorted the results.  There is a detailed discussion of this in the material around Table 25, which appears on p. 70.  the authors note that while the percentage of students in the program who took the test was close to their share of the population of American school - 35% compared to 36% - the distribution was that 40% of students taken that version of PISA were from high-poverty schools (50% or more of the students eligible) while only 23% of US students attend such schools.  If one looks at schools with 75% eligibility, 16% of the sample came from such schools while only 6% of US students attend such schools.  Since we know that students of similar economic background perform better in schools with a lower percentage of students from distressed economic circumstances, the authors argue that there is at least a strong suggestion that the overall score for the U. S. is distorted.

Second, the distribution of students in poverty is not necessarily the same for those being tested on math and those being tested on reading on the same application of the test.   The authors point to a specific inconsistency for Finland.

Third, the population samples are not stable from application to application.  This MAY be an artifact of the changing of the school populations during that time, but also may represent sampling error which can explain some of the variation in scores over time.

Fourth, the content that is sampled for Mathematics varies from test to test, which may account for some of the variance in results between different applications of the same test.

Fifth, the math being sampled may or may not match the curricular aims of a particular nation.   A greater emphasis on questioning students on geometry would advantage a nation which emphasizes geometry by the age the students are tested while disadvantaging a nation which has a greater emphasis on things like algebra and statistics. US math curricula tend to have a greater emphasis on the latter, while the tests have increasingly sampled geometry.  Chances in the percentages of the various mathematical domains being sampled can impact a nation's performance over time without necessarily representing accurately possible changes in actual mathematical performance by the students.

Sixth, changing math items to greater emphasis on problem-solving as PISA has done makes the math assessments effectively a reading comprehension test as well as a mathematics test.  Given the impact that parental literacy has on reading comprehension means the results may be more accurate for mathematical performance for higher- than for lower-class students.  It also may advantage ON MATH those nations with more effective literacy programs.

Sixth -  testing at a particular age may represent different years of educational background.  Nations begin formal schooling at different ages, but this is further conflated by the availability of early childhood education and what that entails.  Here it is worth noting that the U. S. continues to lag in providing universal access to early childhood education, which along with a higher degree of poverty and a larger share of population with indicators such a lower number of books in the home and/or higher percentage of mothers with low educational attainment may also impact the results we are seeing.

There is lots more.

The authors explain the methodologies they use to analyze the data.   They raise cautions about their own analysis where appropriate.

Secretary of Education Duncan issued a now-famous statement after the issuance of the 2011 TIMMS scores, calling the results "unacceptable" and making his arguments by combining those results with those of the 2009 PISA, whose results he argued justified the policies he had been pursuing, which I remind readers included Race to the Top and the BluePrint for Education, and now include the requirements he imposed for states to receive waivers on the requirements of No Child Left Behind.  Carnoy and Rothstein, in one of the most important paragraphs of this document, write
This conclusion, however, is oversimplified, exaggerated, and misleading.  It ignores the complexity of the content of test results and may well be leading policymakers to pursue inappropriate and even harmful reforms that change aspects of the U. S. education system that may be working well and neglect aspects that may be working poorly.  (p. 6)
 For example, low scores on a test that emphasizes geometry over statistics and probability might lead to changing the emphasis, but as the authors note
It certainly might not be good public policy to reduce curricular emphasis on statistics and probability, skills essential to an educated citizenry in a democracy, in order to make more time available for geometry. (p.8)
By now you should have a sense of the thoroughness of this report.  It may not be the kind of reading the average person would select for her bedtime (unless one expected to become drowsy while wading through the detailed tables and explanations).  It is nevertheless something that should be essential reading for those who want to draw APPROPRIATE conclusions from the international comparison.

As professional educator whose life work is severely impacted by the decisions of policy makers, and as one who engages in the discourse over education policy that is often shaped by journalists, I fervently hope that both policy makers and journalists will take the time to educate themselves about the real information in such tests. so that we do not continue to pursue policies based on false understandings that are likely to be counterproductive to real learning without necessarily achieving the illusory goal of higher relative performance on international comparisons.

As the authors note on page 5
To make judgments only on the basis of statistically significant differences in national average scores, on only one test, at only one point in time, without regard to social class context or curricular or population sapling methodologies, is the worst possible choice.  But unfortunately, this is how most policymakers and analysts approach the field.
The field is educational policy.

It shapes the future of our nation.

It involves hundreds of billions of taxpayer dollars.

It shapes the development of our most precious resource, our young people.

Might not it make sense to understand what data really means before we commit major resources to attempt to change that data?

One would think.

I for one am exceedingly grateful for the magnificent effort Carnoy and Rothstein have made to help us with that understanding.

If you truly want to understand what the data from PISA and TIMMS means, this report is essential reading.

If you want a broad overview but do not want to drill down, at least read the Executive Summary. 


Anonymous said...

One can't help wondering if similar understanding and analysis of national assessments which we do in the states have similar short comings? Does imposing a common curriculum and assessment instrument mean that kids in Detroit and kids in Des Moine can be evenly compared? Could it be that a state's investment in its education system could impact student learning? Could the culture of an individual school within a single state influence how those kids perform? And on and on.

Again, the assessment tail is wagging the school dog when it has so many short coming and misunderstanding that if it was an instructional strategy in primary reading we probably wouldn't even be using it. What does a homgonized national average score on one test on one day really tell us?

Today's average national gas price was $3.301 for a gallon of regular. Yesterday it was $2.293. LA is $3.746 today and $3.736 yesterday. Lexington is $3.239 today and $3.251 yesterday. On the way in today I bought it for $3.189 So what does all of that mean? LA has the most expensive gas and went up a penny when the national increase was less than a penny and Lexington dropped almost 2 cents but I got my gas a nickle cheaper than the Lexington average for today. My point is the product is basically the same, but each environment in which it exists presents different influences (transportation costs, taxes, COLA, franchise expenses, consumer demand, competition, environmental events/conditions, etc) These factors bear constantly changeing influences on price and most of these influences can not be controlled by one single entity (oil company, distributor, government, gas station owner, etc) You can't really compare the number either to different states or to some homogenized national number - its all relative to the environment in which the product exists.

Obviously, kids scores are not gas prices, but I think there are some general parrallels here as supported by this article. I am not so sure that identifying that majical average level of performance or even ranking scores tells an indivdual kid or teacher much of anything but instead is some sort of justification for those in the classroom to empower themselves and impose their will on that which they do not contribute or understand.

Anonymous said...

Last contribution correction last sentence: "...for those NOT in the classroom..."

Richard Innes said...


My understanding is the Carnoy/Rothstein report uses the number of books in the home as its sole way to assess and control for student poverty around the world. Without adding other measures, that is awfully thin.

Furthermore, do you think the book metric is a consistent measure of poverty from country to country? I honestly don’t know.

That said, I think there is some merit in trying to compensate for a situation that winds up comparing lots of poor Hispanic, non-English as a first language immigrants and other minorities in US schools to largely homogeneous, native-born, homeland language speaking student populations in other countries. The big problem is how to fairly do that.

BTW, the same issue applies when you try to compare Kentucky’s education performance to performance in other states. I'm not much impressed by some of the "good news" reports you like to jump all over (think Quality Counts among others) that ignore the fact that Kentucky's schools remain highly white while those lower-scoring poor, Hispanic non-English as first language immigrants and lots of other lower-scoring minorities now predominate in other states.

If you are going to pay attention to the effort in the Carnoy/Rothstein report, you need to do the same consistently in other reporting, as well.

Richard Day said...

I suspect you're both correct to some extent. My fear is that the bounds of validity and reliability have been stretched beyond their capacities to produce trustworthy information.

But my doubts won't keep the next hundred reports from hitting the media before critical examination occurs. It's how the game is being played these days.