Saturday, October 04, 2008

Testing Expert Sees ‘Illusions of Progress’ Under NCLB

This from Ed Week:

...Harvard University researcher Daniel M. Koretz has a new book, Measuring Up: What Educational Testing Really Tells Us. Koretz contends that NCLB has prompted widespread teaching to the test, and gaming of the high-stakes testing system producing scores on state standardized tests that are substantially better than students’ mastery of the material.


I have little doubt that Koretz is correct that high-stakes testing prompted some folks to try to game the system. But to illustrate his point, he uses Kentucky's KIRIS assessment which was discontinued by 1998, three years before NCLB. This is a curious choice if one hopes to illustrate a current problem with state systems and NCLB.
Mr. Koretz pointed to research in the 1990s on the state standardized test then used in Kentucky, which was designed to measure similar aspects of proficiency as the National Assessment of Educational Progress, the federally sponsored testing program often called “the nation’s report card.”

Scores on both tests should have moved more or less in lock step, he said. But instead, 4th grade reading scores rose sharply on the Kentucky Instructional Results Information System test, which resembled an NCLB-test prototype, from 1992 to 1994, while sliding slightly on NAEP over the same period.

I'm not sure how it is Koretz determined that KIRIS was in lock step with NAEP. As I recall, the biggest problem with KIRIS was that the damn thing wouldn't hold still - that, and the fact that Kentucky did not even have an underlying curriculum at the time. As Commissioner Thomas Boysen frequently said, we were building the airplane as we flew it. KIRIS was Exhibit A. It changed every year.


Was the NAEP similarly unstable? I don't recall that it was.

3 comments:

Anonymous said...

Here’s some enlightenment on the blog’s remarkably uninformed comments RE: Harvard Graduate School of Education Professor Daniel Koretz.

How did Koretz know that KIRIS should have been pretty much in “lock step” with the NAEP?

Koretz was one of six nationally recognized psychometricians chosen by the Kentucky Office of Education Accountability in 1995 to research KIRIS and render a report on same. That hugely detailed report is a classic. It points out many problems with KIRIS, far too many to list here (read the report – OEA still had hard copies last time I asked). Furthermore, Koretz did follow-up research on Kentucky’s testing of accommodated students in three later reports that stretch out into the current decade.

By the way, many of those early problems Koretz identified still remain today in “Son of KIRIS” – namely the CATS assessment – things like vague portfolio scoring and equally dubious scoring of open-response questions (performed by part-timers working for not-that-much above minimum wage).

Anyway, Koretz is a recognized national expert. He did a lot of research here, and he knows a great deal about the “warts” in KIRIS.

And, Koretz is quite correct in claiming that KIRIS was sort of a NCLB prototype test. After all, CATS, the outgrowth of KIRIS, IS an NCLB test. The Principal might not like that fact because Koretz says the earlier KIRIS assessment was “bogus,” but history supports him.

Koretz probably knows some of the more recent history of CATS as “Son of KIRIS,” as well. After all, as EdWeek reports, Koretz is definitely looking nationwide at the growing discrepancies between “proficiency” as reported on state tests and the rates returned by the NAEP. Kentucky’s growing discrepancies, especially in fourth grade reading (a hot topic with Koretz, obviously) are too large for a competent researcher to miss. You can read more on that by looking here:

http://www.bipps.org/pubs/2007/CATSinDecline.pdf

Richard Day said...

Ouch, Baby. Ouch.

Hero worship aside, Mr Innes should recall that like him, I too testified against the KIRIS. My complaint was that it was useless to principals for measuring change because it changed every year - not that it was perfect.

Everybody knew it was imperfect including Koretz and the technical panel that was trying to build the thing. All social science tests are imperfect.

But that isn't the point I made.

What I don't understand is how KIRIS is supposed to have been matched in any particular way to the NAEP.

But if the suggestion is that Koretz leads us to greater enlightenment then BGI should think twice.

Koretz says it is inappropriate for any of us to use tests to compare schools - something BGI does religiously.

BGI, KSN&C and Koretz all like the idea of a value-added assessment but Koretz warns that it is inappropriate for judging teachers. "Value-added methods are by no means a silver bullet...For example, estimates of growth in individual classrooms in a single year are generally very imprecise, which is to say that they bounce around a good bit because of irrelevant factors. The result is that in any given year, many teachers will be misclassified," Koretz recently told Teacher magazine.

And for those who would remove teacher judgment from the asssessment process Koretz says, "Our current policies assume that test scores are sufficient and that there is no need for any human judgment in the evaluation of schools. I think that is a serious mistake..."

Anonymous said...

The Principal writes, ”What I don't understand is how KIRIS is supposed to have been matched in any particular way to the NAEP.”

This isn’t hard to explain.

KIRIS was built on NAEP Frameworks and there was a considerable amount of crosstalk between NCES and the KDE when KIRIS was being set up. KERA explicitly required this, and the KDE went to considerable lengths to comply. We even got access to some NAEP questions to embed in KIRIS.

Same frameworks, same formats, even similar questions and an obvious indication in the law that results should track. They still didn’t, as Koretz points out.

It’s been a while since I looked, but I think some of this information is in the OEA Panel Report I mentioned in my earlier comments to this blog item.

RE: The Principal’s comments about judging teachers with value-added assessments: I agree that one year of test data isn’t adequate. I agree that there are “weak” and “strong” classes. However, this averages out over time. Tennessee uses three-year averages to develop its teacher evaluations, a year more than we use with CATS to condemn or praise an entire school, and that seems about right to me, absent better research.