Sunday, May 13, 2012

Robo-graders like ETS’s E-Rater aren’t good enough yet.

Standardized tests will finally ask good essay questions. 

But robot grading threatens that progress.

This from Dana Goldstein in Slate:
In 2002, Indiana rolled out computer scoring of its 11th grade state writing exam. At the time, ETS, the company that developed Indiana’s software, said automatic writing assessment could help cut the state’s testing budget in half. But by 2007, Indiana had abandoned the practice.

Why? Though ETS’s E-Rater proved adept at scoring so-called “naked” essays based only on personal opinion, it couldn’t reliably handle questions that required students to demonstrate knowledge from the curriculum. State testing officials tried making lists of keywords the software could scan for: in history, for example, “Queen Isabella,” “Columbus,” and “1492.” But the program didn’t understand the relationship between those items, and so would have given full credit to a sentence like, “Queen Isabella sailed 1,492 ships to Columbus, Ohio.” Cost and time savings never materialized, because most tests also had to be looked at by human graders.

Indiana’s experience is worth keeping in mind, since although the technology has not advanced dramatically over the past decade, we’re now in the midst of a new whirlwind of enthusiasm about electronic writing assessment. Last month, after a study from Mark Shermis of the University of Akron announced that computer programs and people award student-writing samples similar grades, an NPR headline teased, “Can a Computer Program Grade Essays As Well As a Human? Maybe Even Better, Study Says.” Education technology entrepreneur Tom Vander Ark, who co-directed the Shermis study, hailed the results as proof that robo-grading is “fast, accurate, and cost-effective.”

He is right about “fast”: E-Rater can reportedly grade 16,000 essays in 20 seconds. But “accurate” and “cost-effective” are debatable, especially if we want students to write not only about what they think and feel, but also about what they know. Testing companies acknowledge it is easy to game the current generation of robo-graders: Such software rewards longer word counts, unusual vocabulary, transition words such as “however” and “therefore,” and grammatical sentences—whether or not the facts contained within the sentences are correct. To address these problems, the Hewlett Foundation, which also paid for the Shermis study, is offering a $100,000 prize to the team of computer programmers that can make the biggest strides in improving the technology.
The recent push for automated essay scoring comes just as we’re on the verge of making standardized essay tests much more sophisticated in ways robo-graders will have difficulty dealing with. One of the major goals of the new Common Core curriculum standards, which 45 states have agreed to adopt, is to supplant the soft-focus “personal essay” writing that currently predominates in American classrooms with more evidence-driven, subject-specific writing. The creators of the Common Core hope machines can soon score these essays cheaply and quickly, saving states money in a time of harsh education budget cuts. But since robo-graders can’t broadly distinguish fact from fiction, adopting such software prematurely could be antithetical to testing students in more challenging essay-writing genres...

1 comment:

Anonymous said...

If we are using computers to evaluate human writing, then why are we even learning how to write? Why not just have the computer do it for you. Just speak to the computer and it can translate it into a well written sentence.


Uneducated illiterate non writer:
"It was really bitch'en but it sucked too."

Computer translation:
"It was the best of times, it was the worst of times."

I just don't understand that we expect human teachers to instruct students but have computer programs evaluate if students learned and instruction occurred. Maybe I am a member of a dying profession?!?!? Soon to be replaced by some automated, computerized machine.