Bob Linn is often described as the dean of American testing
experts. Howard Everson, long the Vice-President for Research at the
College Board, is one of the nation's top psychometricians. Marc Tucker asked the
two of them to draw on their nearly half century of experience in this
field to reflect on the current state of testing in the light of the
recent emphasis on student performance standards and the growing
strength of the accountability movement.
Marc Tucker:
Over the last 15 years, each of you in succession has chaired the New
York State Technical Advisory Committee, responsible for advising the
state on testing issues. New York State has, since the Civil War been
home to the Regents Exams, an essay-based examination system originally
designed for courses taken by elite students. More recently, New York
State has also been home to a more conventional multiple-choice system
of accountability testing. Both have been evolving. Have they been
sitting comfortably side by side?
Howard Everson: The
TAC had been raising questions about the reliability of the scores from
the Regents for a long time, but New York State officials did not
really focus on those issues until the accountability movement placed
much higher stakes on the results. These issues were exacerbated in the
case of the Regents by the fact that the teachers, not the state, had
control of the content and scoring of the exams. At the same time, the
Regents were concerned that the multiple choice tests were not rigorous
enough, not measuring the higher cognitive demands associated with
college readiness or the Common Core State Standards.
Bob Linn:
It has always been my view that there are just some things you can't do
with multiple-choice test items. Having kids construct their own
answers has important advantages. But there can be reliability and
comparability issues when teachers score them. I should note, though,
that the College Board, to its credit, has worked hard, and I think
successfully, to increase reliability in its Advanced Placement program
by training the scorers properly. That shows that it is possible to
have a reliable course-based system of assessment with human beings
scoring the exams.
MT: So how do the two of you
think about reliability and validity now, in the light of this history
and the current demands on testing?
HE: I would add
to the tradeoffs among validity, reliability and cost, the stakes that
get attached to the tests. When Bob and I were on the TAC in the early
1990s, the stakes were shifting and changing and they are continuing to
do so. As the stakes rose, we were pushed to emphasize reliability
more, validity less.
MT: Do these high stakes push you towards multiple-choice?
HE:
Yes. It is not just the technology of the testing itself that changes
over time; the most important change has been in the way tests are used
here in the United States.
MT: If we look
globally at the kinds of assessments used by top-performing countries,
we see testing systems based largely on curriculum that are similar in
design to the Regents and the College Board Advanced Placement
tests—heavily essay-based with some multiple-choice items, but mostly
essay. Those countries are doing very well. In their systems, there is
not much test-based accountability for teachers but there is a great
deal of test-based accountability for students. Their exams probably
don't meet some of the psychometric standards for reliability we have in
the United States. Where does that leave the two of you on where we
should be going as a country?
HE: There has been a
lot of litigation on testing in the United States and I think it has
forced us to emphasize test score reliability over validity.
BL:
One big difference between the United States and other countries is the
prestige and trust in teachers, which is very low in this country and
tends to be quite high in the top performers. This has led to the
development of accountability systems that use external measures to see
if schools and individual teachers are doing a good job. This has
morphed into the next level: evaluating individual teachers. Unless we
can find a way to increase the prestige of teachers and public
confidence in them, it will be hard to move too far away from using
testing for these purposes.
MT: If we were to
rely less on tests for reliability purposes, do you think we would be
able to develop tests that will do a better job of measuring the kind of
higher order thinking skills that lie at the heart of the Common Core
State Standards.
BL: Yes. I think that is true.
MT:
What is your advice on the right balance between validity and
reliability, especially if we want to embrace the goals implicit in the
Common Core?
HE: I think the importance of reliability has been overblown.
BL:
I agree it is less important than comparability and validity and
fairness. It would be highly desirable to go where the two state
testing consortia want to go. They want to include, in addition to
multiple-choice items, items where kids are required to do things, solve
problems and show how they come up with solutions to the problems they
are given. But the realities of timing and cost are pushing them in a
direction that will likely force them to come up short.
MT:
Is this country getting ready to make a profound mistake? We use
grade-by-grade testing in grades 3-8 but no other country is doing it
this way for accountability; instead they test 2 or 3 times in a
students' career. If the United States did it that way, we could afford
some of the best tests in the world without spending any more money.
BL:
Raising the stakes for our test-based accountability systems so that
there will be consequences for individual teachers will make matters
even worse. Cheating scandals will blossom. I think this annual
testing is unnecessary and is a big part of the problem. What we should
be doing is testing at two key points along the way in grades K-8, and
then in high school using end-of-course tests.
HE:
I am in the same place as Bob. The multiple-choice paradigm first used
in WWI and eventually used to satisfy the NCLB requirements has proven
to be quite brittle, especially when applied in every grade 3-8 and used
to make growth assumptions. The quick and widespread adoption of
multiple-choice testing was in hindsight a big mistake for this country,
but—now — states will tell you it is all they can afford.
No comments:
Post a Comment