Mark Shermis and Ben Hamner are presenting a paper on automated essay scoring at NCME on Monday,1 and there has already been quite a bit of reporting on it, such as yesterday’s article at IHE. I link to the IHE reporting because the comment thread is interesting and includes a contribution by Les Perelman, a nonsensical bit of writing that achieved a high score in an (unnamed) automatic essay scoring algorithm.
Shermis and Hamner worked with thousands of essays from several states, with subsamples drawn to allow development of models by each of several algorithms and then the testing of those models against a second set of subsamples. For the tasks chosen–all of which were on-demand, relatively short pieces of writing on state tests–the automated essay scoring algorithms were generally about as decent in technical measures as individual human readers. Let me cut to the chase: this demonstrates the viability of automated essay scoring for commonly-used mediocre writing tests. This is a more cynical view than what the authors write on p. 27, but you may note that they are not claiming this is the next best thing since sliced bread:
As a general scoring approach, automated essay scoring appears to have developed to the point where it can be reliably applied in both low-stakes assessment (e.g., instructional evaluation of essays) and perhaps as a second scorer for high-stakes testing.
Shermis and Hamner’s work appears on my first reading to be well-designed, and I think the conclusions are trustworthy as long as one understands what the computer algorithms were asked to mimic: holistic scoring of on-demand writing. In the context of state-level assessment, the scoring is often conducted by human graders brought into a large facility to give a single score to each of hundreds of short essays written by eighth graders in response to a prompt asking for a narrative, expository, or persuasive bit of text.2 Personal grading this isn’t, and as long as the goal is scoring thousands of short essays without important feedback to students, then I guess it is true that the study suggests that automated essay scoring is “fast, accurate, and cost effective,” as Tom Vander Ark was quoted in a press release emailed to me a few days ago.
I have been racking my brains trying to figure out where such automated essay scoring might be useful in any context other than large-jurisdiction summative assessments, and it is hard to picture where you could find it useful. I suppose that with some algorithms you could train the scoring program to focus on issues an individual teacher might care about (for me, issues such as coherence of passages and logical sequencing of ideas), but these programs probably would require far more samples and scores than most teachers work with. I strongly suspect that even the programs using natural language processing cannot give feedback to students at anything other than the global scale.
Fortunately, there is one use I read, in that IHE comment thread, that might make sense: David Horacek’s idea to set a particular automated scoring result as a threshold for reading work.
[H]ere’s something I could imagine saying to my students: “I will only accept/read term papers that the AES [automated essay score] grades as B or better, so start early so that you don’t miss the due date.” (They would have the AES on their own computers, so they could “grade” their drafts themselves as often as they wanted.) Of course, their actual grade would be determined by me. The automatic grader would just filter out lazy and sloppy work, leaving me to focus my evaluation on the quality of ideas.
So the purpose here is to get students to focus on writing. My guess is that in the ideal, this would push students to write papers a little earlier to give themselves time to fix problems to earn a high enough basic-writing score to be graded.
Ah, but if that is what we want–students to take some care in writing–we could set another requirement that forces students to proofread. Say, requiring that students demonstrate that they have used their word processors’ spell-check features and use “Track Changes” (or the equivalent) to show any changes in the draft before submission.
Any other ideas for how to use these automated essay scoring algorithms, other than large-scale formative assessment?
Addendum: For a sample skeptical take, see Audrey Watters. For a more hopeful take, see Justin Reich (see my concerns above about minimum n for stable estimates). For total cynicism, try Michael Winerip, mellifluously.