Computers Weighing In On the Elements of Essay
Programs Critique Structure, Not Ideas

Lea Kleinschmidt says computer grading of her GMAT writing test allowed her to get quick feedback. (Susan Biddle -- The Washington Post)

By Jay Mathews
Washington Post Staff Writer
Sunday, August 1, 2004; Page A01

Nadeem Sarwar remembered all the lessons on good writing drummed into him by his ninth-grade teacher at the Newport School in Montgomery County. So last summer when he took the GMAT, the standard entrance exam for business graduate schools, he was careful to be clear and organized, just as Janice Marcus had taught him.

But Sarwar, 29, said he was disturbed to discover that one of the graders of his essays on that test would not have been the least bit impressed by Marcus's teaching. The grader wasn't even human.

Sarwar's score of five out of six possible points on the analytical writing section was awarded, in part, by a computer program named e-rater. His performance on the GMAT was strong enough to help him gain admission to the George Washington University School of Business. Still, he cringes at the thought of his words being judged by a machine housed in a brick building in Newark, Del.

"I would feel safer having English teachers with 30 years' experience reading my essay," he said. "They have the background and the perspective to understand what I am saying."

Many students who have agonized over getting a high-stakes essay just right express similar views. Yet few of them know that essay-grading computers are quietly making significant gains in the booming U.S. testing industry.

More than 2 million essays have been scored by e-rater since it was adopted for the GMAT in 1999, and the technology is being considered for use in the Graduate Record Examination, for graduate school admissions, and the Test of English as a Foreign Language, which assesses the English proficiency of immigrants entering U.S. schools.

Testing experts predict that machines eventually will help grade the SAT and the ACT, which will add writing sections in their 2005 college admissions tests, because computers cost less money and work faster than humans. Before technology entered the picture, teams of people graded each GMAT essay. Now one human grader's judgment is compared with the machine's conclusion.

"It is sort of inevitable," said Jeff Rubenstein, vice president for technology at the test-preparation company Princeton Review, "but it is also sort of regrettable." He said he knows test takers "who are brilliant writers, but they write very subtly," and when a machine is grading them, "they score terribly."

The e-rater was developed in the 1990s by subjecting essays to a "natural language processing technology" that identifies grammar, sentence structure and strength of vocabulary. The computer also is programmed to scan for the elements present in a well-supported essay, said Richard Swartz, an executive with Educational Testing Service, which developed e-rater.

The human grader helps prevent test takers from fooling the computer with an essay that has an orderly structure but, from a logic standpoint, is nonsense, according to David Wilson, president and CEO of the nonprofit Graduate Management Admission Council, which gives the GMAT.

But critics are unconvinced. "When machines can provide a good summary of the Federalist Papers and a competent commentary on their style and cogency, I will then believe that they can replace competent human readers on important assignments," said Will Fitzhugh, founder of several groups, including the Massachusetts-based National Writing Board, that promote high school research and composition.

The developers of e-rater and other essay-scoring programs shrug off such barbs as old thinking by people who do not realize how accurate and dependable their systems are -- or how careful they are to compare results with human graders. When GMAT essays are graded by a human and by e-rater, Swartz said, a second person is brought in whenever the scores differ by more than one point on the six-point grading scale.

He said the first experimental e-rater programs gave more credit to essays that were longer and had longer sentences and longer words, but those flaws were eliminated.

Much of the research used to refine the technology involved taking essays that had been graded by people and looking for signs the computer could use to come close to those human judgments. The GMAT Web site boasts that "e-rater and independent readers agree, on average, 87 percent to 94 percent of the time."

Swartz emphasized the modest goal of computerized scoring: to judge the structure and coherence of the writing, rather than the quality of the thoughts and originality of the prose. In college, he said, professors grade the development of ideas, while essay-rating computers "are better suited to judgment about more basic-level writing."

Some students whose writing has been scrutinized by e-rater said they recognize this essential difference and don't fear the new technology.

Lea Kleinschmidt, who has a bachelor's degree in English literature from Florida State University, took the 3 1/2-hour GMAT at a testing center in the District last year. She had no complaints about her e-rater score -- six out of a possible six points on the one-hour analytical writing assessment.

The test givers were looking "more for your ability to communicate than what you actually said," Kleinschmidt noted, and the computer allowed her to get quick feedback in the process of winning admission to the Robert H. Smith School of Business at the University of Maryland.

Wilson said he believes the e-rater, on average, delivers more accurate results than a human grader because it's "not subject to the variabilities of mankind." But the principal reason the GMAT adopted e-rater was "we could save money on that and pass it on to the test takers," he said.

Essay-grading computer programs are being used by some online universities and companies to improve student and employee writing.

E-rater also is being used in the new Collegiate Learning Assessment, a three-hour test of analytical and critical thinking designed by the Rand Corp. to help colleges find out how much their students are learning. Computer grading systems also are being developed by Thomson Prometric, Vantage Laboratories and other companies.

Machines will not be grading the new SAT and ACT essay questions when they debut next year, but many experts say they will someday.

Even the College Board, provider of the SAT, does not rule it out. "It is not inconceivable that we would do it in the distant future," said Chiara Coletti, the College Board's vice president for communications and public affairs.

The Princeton Review and its test-preparation rival, Kaplan Inc., which is owned by The Washington Post Co., recommend that those taking the GMAT take a conservative approach to the machine-graded essays -- using topic sentences at the beginning of each paragraph and other methods they learned in ninth grade.

Only on personal essays for college applications do educators seem to be taking a hard line against computer scoring.

"From my 30 years of experience in college admissions, I don't think that this would be acceptable to the public," said Bruce Walker, vice provost and director of admissions at the University of Texas.

That is fine with Christine Bell, whose GMAT score was good enough for admission to the George Washington University School of Business, but who said she felt "kind of strange" when she learned that a computer grader had been involved.

"As computers advance, we are seeing amazing ability for them to be more human," she said. "But I don't think we have reached the point where we are completely compatible."

© 2004 The Washington Post Company