Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a method of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades—for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification.
Several factors have contributed to a growing interest in AES. Among them are cost, accountability, standards, and technology. Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards. The advance of information technology promises to measure educational achievement at reduced cost.
The use of AES for high-stakes testing in education has generated significant backlash, with opponents pointing to research that computers cannot yet grade writing accurately and arguing that their use for such purposes promotes teaching writing in reductive ways (i.e. teaching to the test).
Most historical summaries of AES trace the origins of the field to the work of Ellis Batten Page. In 1966, he argued for the possibility of scoring essays by computer, and in 1968 he published his successful work with a program called Project Essay Grade™ (PEG™). Using the technology of that time, computerized essay scoring would not have been cost-effective, so Page abated his efforts for about two decades.
By 1990, desktop computers had become so powerful and so widespread that AES was a practical possibility. As early as 1982, a UNIX program called Writer's Workbench was able to offer punctuation, spelling, and grammar advice. In collaboration with several companies (notably Educational Testing Service), Page updated PEG and ran some successful trials in the early 1990s.
Peter Foltz and Thomas Landauer developed a system using a scoring engine called the Intelligent Essay Assessor™ (IEA). IEA was first used to score essays in 1997 for their undergraduate courses. It is now a product from Pearson Educational Technologies and used for scoring within a number of commercial products and state and national exams.
IntelliMetric® is Vantage Learning's AES engine. Its development began in 1996. It was first used commercially to score essays in 1998.
Educational Testing Service offers e-rater®, an automated essay scoring program. It was first used commercially in February 1999. Jill Burstein was the team leader in its development. ETS's CriterionSM Online Writing Evaluation Service uses the e-rater engine to provide both scores and targeted feedback.
Lawrence Rudner has done some work with Bayesian scoring, and developed a system called BETSY (Bayesian Essay Test Scoring sYstem). Some of his results have been published in print or online, but no commercial system incorporates BETSY as yet.
Under the leadership of Howard Mitzel and Sue Lottridge, Pacific Metrics developed a constructed response automated scoring engine, CRASE®. Currently utilized by several state departments of education and in a U.S. Department of Education-funded Enhanced Assessment Grant, Pacific Metrics’ technology has been used in large-scale formative and summative assessment environments since 2007.
Measurement Inc. acquired the rights to PEG in 2002 and has continued to develop it.
In 2012, the Hewlett Foundation sponsored a competition on Kaggle called the Automated Student Assessment Prize (ASAP). 201 challenge participants attempted to predict, using AES, the scores that human raters would give to thousands of essays written to eight different prompts. The intent was to demonstrate that AES can be as reliable as human raters, or more so. This competition also hosted a separate demonstration among 9 AES vendors on a subset of the ASAP data. Although the investigators reported that the automated essay scoring was as reliable as human scoring, this claim was not substantiated by any statistical tests because some of the vendors required that no such tests be performed as a precondition for their participation. Moreover, the claim that the Hewlett Study demonstrated that AES can be as reliable as human raters has since been strongly contested, including by Randy E. Bennett, the Norman O. Frederiksen Chair in Assessment Innovation at the Educational Testing Service. Some of the major criticisms of the study have been that five of the eight datasets consisted of paragraphs rather than essays, four of the eight data sets were graded by human readers for content only rather than for writing ability, and that rather than measuring human readers and the AES machines against the "true score", the average of the two readers' scores, the study employed an artificial construct, the "resolved score", which in four datasets consisted of the higher of the two human scores if there was a disagreement. This last practice, in particular, gave the machines an unfair advantage by allowing them to round up for these datasets.
From the beginning, the basic procedure for AES has been to start with a training set of essays that have been carefully hand-scored. The program evaluates surface features of the text of each essay, such as the total number of words, the number of subordinate clauses, or the ratio of uppercase to lowercase letters—quantities that can be measured without any human insight. It then constructs a mathematical model that relates these quantities to the scores that the essays received. The same model is then applied to calculate scores of new essays.
Recently, one such mathematical model was created by Isaac Persing and Vincent Ng. which not only evaluates essays on the above features, but also on their argument strength. It evaluates various features of the essay, such as the agreement level of the author and reasons for the same, adherence to the prompt's topic, locations of argument components (major claim, claim, premise), errors in the arguments, cohesion in the arguments among various other features. In contrast to the other models mentioned above, this model is closer in duplicating human insight while grading essays.
The various AES programs differ in what specific surface features they measure, how many essays are required in the training set, and most significantly in the mathematical modeling technique. Early attempts used linear regression. Modern systems may use linear regression or other machine learning techniques often in combination with other statistical techniques such as latent semantic analysis and Bayesian inference.
Criteria for success
Any method of assessment must be judged on validity, fairness, and reliability. An instrument is valid if it actually measures the trait that it purports to measure. It is fair if it does not, in effect, penalize or privilege any one class of people. It is reliable if its outcome is repeatable, even when irrelevant external factors are altered.
Before computers entered the picture, high-stakes essays were typically given scores by two trained human raters. If the scores differed by more than one point, a third, more experienced rater would settle the disagreement. In this system, there is an easy way to measure reliability: by inter-rater agreement. If raters do not consistently agree within one point, their training may be at fault. If a rater consistently disagrees with whichever other raters look at the same essays, that rater probably needs more training.
Various statistics have been proposed to measure inter-rater agreement. Among them are percent agreement, Scott's π, Cohen's κ, Krippendorf's α, Pearson's correlation coefficient r, Spearman's rank correlation coefficient ρ, and Lin's concordance correlation coefficient.
Percent agreement is a simple statistic applicable to grading scales with scores from 1 to n, where usually 4 ≤ n ≤ 6. It is reported as three figures, each a percent of the total number of essays scored: exact agreement (the two raters gave the essay the same score), adjacent agreement (the raters differed by at most one point; this includes exact agreement), and extreme disagreement (the raters differed by more than two points). Expert human graders were found to achieve exact agreement on 53% to 81% of all essays, and adjacent agreement on 97% to 100%.
Inter-rater agreement can now be applied to measuring the computer's performance. A set of essays is given to two human raters and an AES program. If the computer-assigned scores agree with one of the human raters as well as the raters agree with each other, the AES program is considered reliable. Alternatively, each essay is given a "true score" by taking the average of the two human raters' scores, and the two humans and the computer are compared on the basis of their agreement with the true score.
Some researchers have reported that their AES systems can, in fact, do better than a human. Page made this claim for PEG in 1994. Scott Elliot said in 2003 that IntelliMetric typically outperformed human scorers. AES machines, however, appear to be less reliable than human readers for any kind of complex writing test.
In current practice, high-stakes assessments such as the GMAT are always scored by at least one human. AES is used in place of a second rater. A human rater resolves any disagreements of more than one point.
AES has been criticized on various grounds. Yang et al. mention "the overreliance on surface features of responses, the insensitivity to the content of responses and to creativity, and the vulnerability to new types of cheating and test-taking strategies." Several critics are concerned that students' motivation will be diminished if they know that no human will read their writing. Among the most telling critiques are reports of intentionally gibberish essays being given high scores.
On March 12, 2013, HumanReaders.Org launched an online petition, "Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment". Within weeks, the petition gained thousands of signatures, including Noam Chomsky, and was cited in a number of newspapers, including The New York Times, and on a number of education and technology blogs.
The petition describes the use AES for high-stakes testing as "trivial", "reductive", "inaccurate", "undiagnostic", "unfair", and "secretive".
In a detailed summary of research on AES, the petition site notes, "RESEARCH FINDINGS SHOW THAT no one—students, parents, teachers, employers, administrators, legislators—can rely on machine scoring of essays ... AND THAT machine scoring does not measure, and therefore does not promote, authentic acts of writing."
The petition specifically addresses the use of AES for high-stakes testing and says nothing about other possible uses.
Most resources for automated essay scoring are proprietary.
- eRater – Published by ETS
- Intellimetric – by Vantage Learning
- Project Essay Grade – by Measurement, Inc.
- ^Page, E.B. (2003). "Project Essay Grade: PEG", p. 43. In: Automated Essay Scoring: A Cross-Disciplinary Perspective. Shermis, Mark D., and Jill Burstein, eds. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
- ^Larkey, Leah S., and W. Bruce Croft (2003). "A Text Categorization Approach to Automated Essay Grading", p. 55. In: Automated Essay Scoring: A Cross-Disciplinary Perspective. Shermis, Mark D., and Jill Burstein, eds. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
- ^Keith, Timothy Z. (2003). "Validity of Automated Essay Scoring Systems", p. 153. In: Automated Essay Scoring: A Cross-Disciplinary Perspective. Shermis, Mark D., and Jill Burstein, eds. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
- ^Shermis, Mark D., Jill Burstein, and Claudia Leacock (2006). "Applications of Computers in Assessment and Analysis of Writing", p. 403. In: Handbook of Writing Research. MacArthur, Charles A., Steve Graham, and Jill Fitzgerald, eds. Guilford Press, New York, ISBN 1-59385-190-1
- ^Attali, Yigal, Brent Bridgeman, and Catherine Trapani (2010). "Performance of a Generic Approach in Automated Essay Scoring", p. 4. Journal of Technology, Learning, and Assessment, 10(3)
- ^Wang, Jinhao, and Michelle Stallone Brown (2007). "Automated Essay Scoring Versus Human Scoring: A Comparative Study", p. 6. Journal of Technology, Learning, and Assessment, 6(2)
- ^Bennett, Randy Elliot, and Anat Ben-Simon (2005). Toward Theoretically Meaningful Automated Essay ScoringArchived October 7, 2007, at the Wayback Machine., p. 6. Retrieved 2012-03-19.
- ^Page, E.B. (1966). "The imminence of grading essays by computers". Phi Delta Kappan, 47, 238-243.
- ^Page, E.B. (1968). "The Use of the Computer in Analyzing Student Essays". International Review of Education, 14(3), 253-263.
- ^Page, E.B. (2003), pp. 44-45.
- ^MacDonald, N.H., L.T. Frase, P.S. Gingrich, and S.A. Keenan (1982). "The Writers Workbench: Computer Aids for Text Analysis". IEEE Transactions on Communications, 3(1), 105-110.
- ^ abPage, E.B. (1994). "New Computer Grading of Student Prose, Using Modern Concepts and Software". Journal of Experimental Education, 62(2), 127-142.
- ^Rudner, Lawrence. "Three prominent writing assessment programsArchived March 9, 2012, at the Wayback Machine.". Retrieved 2012-03-06.
- ^ abElliot, Scott (2003). "Intellimetric TM: From Here to Validity", p. 75. In: Automated Essay Scoring: A Cross-Disciplinary Perspective. Shermis, Mark D., and Jill Burstein, eds. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
- ^"IntelliMetric®: How it Works". Retrieved 2012-02-28.
- ^Burstein, Jill (2003). "The E-rater(R) Scoring Engine: Automated Essay Scoring with Natural Language Processing", p. 113. In: Automated Essay Scoring: A Cross-Disciplinary Perspective. Shermis, Mark D., and Jill Burstein, eds. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
- ^ abRudner, Lawrence (ca. 2002). "Computer Grading using Bayesian Networks-OverviewArchived March 8, 2012, at the Wayback Machine.". Retrieved 2012-03-07.
- ^"Assessment Technologies", Measurement Incorporated. Retrieved 2012-03-09.
- ^"Hewlett prize". Retrieved 2012-03-05.
- ^University of Akron (12 April 2012). "Man and machine: Better writers, better grades". Retrieved 4 July 2015.
- ^Shermis, Mark D., and Jill Burstein, eds. Handbook of Automated Essay Evaluation: Current Applications and New Directions. Routledge, 2013.
- ^Rivard, Ry (March 15, 2013). "Humans Fight Over Robo-Readers". Inside Higher Ed. Retrieved 14 June 2015.
- ^ abPerelman, Les (August 2013). "Critique of Mark D. Shermis & Ben Hamner, "Contrasting State-of-the-Art Automated Scoring of Essays: Analysis"". Journal of Writing Assessment. 6 (1). Retrieved June 13, 2015.
- ^Perelman, L. (2014). "When 'the state of the art is counting words', Assessing Writing, 21, 104-111.
- ^Bennett, Randy E. (March 2015). "The Changing Nature of Educational Assessment". Review of Research in Education. 39 (1): 370–407.
- ^Keith, Timothy Z. (2003), p. 149.
- ^Persing, Isaac, and Vincent Ng (2015). "Modeling Argument Strength in Student Essays", pp. 543-552. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Retrieved 2015-10-22.
- ^Bennett, Randy Elliot, and Anat Ben-Simon (2005), p. 7.
- ^Chung, Gregory K.W.K., and Eva L. Baker (2003). "Issues in the Reliability and Validity of Automated Scoring of Constructed Responses", p. 23. In: Automated Essay Scoring: A Cross-Disciplinary Perspective. Shermis, Mark D., and Jill Burstein, eds. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
- ^Elliot, Scott (2003), p. 77.
- ^Burstein, Jill (2003), p. 114.
- ^Bennett, Randy E. (May 2006). "Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress"(PDF). International Association for Educational Assessment. Retrieved 5 July 2015.
- ^McCurry, D. (2010). "Can machine scoring deal with broad and open writing tests as well as human readers?". Assessing Writing. 15: 118–129.
- ^R. Bridgeman (2013). Shermis, Mark D.; Burstein, Jill, eds. Handbook of Automated Essay Evaluation. New York: Routledge. pp. 221–232.
- ^ abYang, Yongwei, Chad W. Buckendahl, Piotr J. Juszkiewicz, and Dennison S. Bhola (2002). "A Review of Strategies for Validating Computer-Automated ScoringArchived January 13, 2016, at the Wayback Machine.". Applied Measurement in Education, 15(4). Retrieved 2012-03-08.
- ^Wang, Jinhao, and Michelle Stallone Brown (2007), pp. 4-5.
- ^Dikli, Semire (2006). "An Overview of Automated Scoring of Essays". Journal of Technology, Learning, and Assessment, 5(1)
- ^Ben-Simon, Anat (2007). "Introduction to Automated Essay Scoring (AES)". PowerPoint presentation, Tbilisi, Georgia, September 2007.
- ^Winerip, Michael (22 April 2012). "Facing a Robo-Grader? Just Keep Obfuscating Mellifluously". The New York Times. Retrieved 5 April 2013.
- ^"Signatures >> Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment". HumanReaders.Org. Retrieved 5 April 2013.
- ^Markoff, John (4 April 2013). "Essay-Grading Software Offers Professors a Break". The New York Times. Retrieved 5 April 2013.
- ^Larson, Leslie (5 April 2013). "Outrage over software that automatically grades college essays to spare professors from having to assess students'". Daily Mail. Retrieved 5 April 2013.
- ^Garner, Richard (5 April 2013). "Professors angry over essays marked by computer". The Independent. Retrieved 5 April 2013.
- ^Corrigan, Paul T. (25 March 2013). "Petition Against Machine Scoring Essays, HumanReaders.Org". Teaching & Learning in Higher Ed. Retrieved 5 April 2013.
- ^Jaffee, Robert David (5 April 2013). "Computers Cannot Read, Write or Grade Papers". Huffington Post. Retrieved 5 April 2013.
- ^"Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment". HumanReaders.Org. Retrieved 5 April 2013.
- ^"Research Findings >> Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment". HumanReaders.Org. Retrieved 5 April 2013.
- ^"Works Cited >> Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment". HumanReaders.Org. Retrieved 5 April 2013.
- ^"Assessment Technologies." Measurement, Inc. http://www.measurementinc.com/products-services/automated-essay-scoring.
"Can we write during recess?" Some students were asking that question at Anna P. Mote Elementary School, where teachers were testing software that automatically evaluates essays for University of Delaware researcher Joshua Wilson.
Wilson, assistant professor in UD's School of Education in the College of Education and Human Development, asked teachers at Mote and Heritage Elementary School, both in Delaware's Red Clay Consolidated School District, to use the software during the 2014-15 school year and give him their reaction.
Wilson, whose doctorate is in special education, is studying how the use of such software might shape instruction and help struggling writers.
The software Wilson used is called PEGWriting (which stands for Project Essay Grade Writing), based on work by the late education researcher Ellis B. Page and sold by Measurement Incorporated, which supports Wilson's research with indirect funding to the University.
The software uses algorithms to measure more than 500 text-level variables to yield scores and feedback regarding the following characteristics of writing quality: idea development, organization, style, word choice, sentence structure, and writing conventions such as spelling and grammar.
The idea is to give teachers useful diagnostic information on each writer and give them more time to address problems and assist students with things no machine can comprehend – content, reasoning and, especially, the young writer at work.
Writing is recognized as a critical skill in business, education and many other layers of social engagement. Finding reliable, efficient ways to assess writing is of increasing interest nationally as standardized tests add writing components and move to computer-based formats.
The National Assessment of Educational Progress, also called the Nation's Report Card, first offered computer-based writing tests in 2011 for grades 8 and 12 with a plan to add grade 4 tests in 2017. That test uses trained readers for all scoring.
Other standardized tests also include writing components, such as the assessments developed by the Partnership for Assessment of College and Careers (PARCC) and the Smarter Balanced Assessment, used for the first time in Delaware this year. Both PARCC and Smarter Balanced are computer-based tests that will use automated essay scoring in the coming years.
Researchers have established that computer models are highly predictive of how humans would have scored a given piece of writing, Wilson said, and efforts to increase that accuracy continue.
However, Wilson's research is the first to look at how the software might be used in conjunction with instruction and not as a standalone scoring/feedback machine.
In earlier research, Wilson and his collaborators showed that teachers using the automated system spent more time giving feedback on higher-level writing skills – ideas, organization, word choice.
Those who used standard feedback methods without automated scoring said they spent more time discussing spelling, punctuation, capitalization and grammar.
The benefits of automation are great, from an administrative point of view. If computer models provide acceptable evaluations and speedy feedback, they reduce the amount of needed training for human scorers and, of course, the time necessary to do the scoring.
Consider the thousands of standardized tests now available – state writing tests, SAT and ACT tests for college admission, GREs for graduate school applicants, LSATs for law school hopefuls and MCATs for those applying to medical school.
When scored by humans, essays are evaluated by groups of readers that might include retired teachers, journalists and others trained to apply specific rubrics (expectations) as they analyze writing.
Their scores are calibrated and analyzed for subjectivity and, in large-scale assessments, the process can take a month or more. Classroom teachers can evaluate writing in less time, of course, but it still can take weeks, as any English teacher with five or six sections of classes can attest.
"Writing is very time and labor and cost intensive to score at any type of scale," Wilson said.
Those who have participated in the traditional method of scoring standardized tests know that it takes a toll on the human assessor, too.
Where it might take a human reader five minutes to attach a holistic score to a piece of writing, the automated system can process thousands at a time, producing a score within a matter of seconds, Wilson said.
"If it takes a couple weeks to get back to the student they don't care about it anymore," he said. "Or there is no time to do anything about it. The software vastly accelerates the feedback loop."
But computers are illiterate. They have zero comprehension. The scores they attach to writing are based on mathematical equations that assign or deduct value according to the programmer's instructions.
They do not grade on a curve. They do not understand how far Johnny has come in his writing and they have no special patience for someone who is just learning English.
These computer deficiencies are among the reasons many teachers – including the National Council of Teachers of English – roundly reject computerized scoring programs. They fear a steep decline in instruction, discouraging messages the soulless judge will send to students, and some see a real threat to those who teach English.
In a recent study, Wilson and other collaborators showed that use of automated feedback produced some efficiencies for teachers, faster feedback for students, and moderate increases in student persistence.
This time they brought a different question to their review. Could automated scoring and feedback produce benefits throughout the school year, shaping instruction and providing incentives and feedback for struggling writers, beyond simply delivering speedy scores?
"If we use the system throughout the year, can we start to improve the learning?" Wilson said. "Can we change the trajectory of kids who would otherwise fail, drop out or give up?"
To find out, he distributed free software subscriptions provided by Measurement Incorporated to teachers of third-, fourth- and fifth-graders at Mote and Heritage and asked them to try it during the 2014-15 school year.
Teachers don't dismiss the idea of automation, he said. Calculators and other electronic devices are routinely used by educators.
"Do math teachers rue the day students didn't do all computations on their own?" he said.
Wilson heard mixed reviews about use of the software in the classroom when he met with teachers at Mote in early June.
Teachers said students liked the "game" aspects of the automated writing environment and that seemed to increase their motivation to write quite a bit. Because they got immediate scores on their writing, many worked to raise their scores by correcting errors and revising their work over and over.
"There was an 'aha!' moment," one teacher said. "Students said, 'I added details and my score went up.' They figured that out."
And they wanted to keep going, shooting for higher scores.
"Many times during recess my students chose to do PEGWriting," one teacher said. "It was fun to see that."
That same quick score produced discouragement for other students, though, teachers said, when they received low scores and could not figure out how to raise them no matter how hard they worked. That demonstrates the importance of the teacher's role, Wilson said. The teacher helps the student interpret and apply the feedback.
Teachers said some students were discouraged when the software wouldn't accept their writing because of errors. Others figured out they could cut and paste material to get higher scores, without understanding that plagiarism is never acceptable. The teacher's role is essential to that instruction, too, Wilson said.
Teachers agreed that the software showed students the writing and editing process in ways they hadn't grasped before, but some weren't convinced that the computer-based evaluation would save them much time. They still needed to have individual conversations with each student – some more than others.
"I don't think it's the answer," one teacher said, "but it is a tool we can use to help them."
How teachers can use such tools effectively to demonstrate and reinforce the principles and rules of writing is the focus of Wilson's research. He wants to know what kind of training teachers and students need to make the most of the software and what kind of efficiencies it offers teachers to help them do more of what they do best: teach.
Bradford Holstein, principal at Mote and a UD graduate who received a bachelor's degree in 1979 and a master's degree in 1984, welcomed the study and hopes it leads to stronger writing skills in students.
"The automated assessment really assists the teachers in providing valuable feedback for students in improving their writing," Holstein said.
Explore further:Genres in writing: A new path to English language learning