Auto-Essay Scoring Engines Confuse Prediction with Measurement

The Automated Student Assessment Prize (ASAP) essay grading competition was created to assess how well different vendors of automated essay scoring (AES) engines, along with public competitors from around the world, could score student-generated essays relative to their human counterparts. The competition was sponsored by the William and Flora Hewlett Foundation and hosted by the data prediction platform kaggle.com. As it turns out, the competition confirmed what has been known for years: AES engines not only have good agreement with their human counterparts, but they agree better with humans than humans agree with themselves! The results of the study have been written up in a recent report and touted by some as the final green light for going ahead and letting AES engines score student essays, particularly in high-stakes situations such as standardized tests since not only are they more consistent, but much faster and orders of magnitude less expensive.

Competitors were given training data which included the text of thousands of essays written to eight different prompts. The scores provided by two human raters were given for each essay. Based on this training data, competitors had to use their AES engines to predict how human raters would score new essays without knowing the human assigned scores. This is the underlying problem concerning the competition.  AES engines are using a number of features that they tease out of the text to predict how two human raters would score an essay; they are not actually measuring (or at least estimating) the quality of the essay based on the harder to define constructs of writing the way the humans who originally scored them did. Of course, humans will often lack consistency in their scoring, but their scores are still based on things that computers do not yet understand.

Additionally, human raters will score the quality of an essay based on its causal attributes, whereas an AES algorithm will also likely exploit noncausal attributes. As it turns out, the number of characters used in an essay is highly correlated with quality. Better writers have the ability to write more in a fixed period of time and writing more means that you have more space to communicate your idea, which is the goal of writing.  But while the number of characters generated is a good predictor of essay quality, it is not a causal attribute of essay quality. It is evident that a sentence repeated 100 times does not represent better quality writing than the same sentence written a single time.

To further highlight these issues, after the vendor-only competition wrapped up I decided that I would attempt to create an AES engine from scratch that would do well based on the scoring metric that the competition used, but actually be a rather inadequate essay grader, and enter it in the public competition so that my results could be verified. Within a week, I had created such an AES engine. Using just the raw essay text (but not the prompt) for an essay set, the resolved (combined) human score, and no pre-existing corpora (giving my algorithm the neat quality of being language-independent), I was able to get a score that placed my performance in the top tier of the private competitors. Considering the fact that the vendor algorithms were developed by many experts over the course of years or even decades, this would be an impressive feat if the issues that I point out didn’t actually exist. Ultimately, I was in the top handful of public competitors invited to try to sell their algorithms to other vendors.

To give you a sense of the simplicity of my approach, my AES algorithm is restricted to the use of some very simple variables, the most important of which (not surprisingly) was the number of total characters in the raw text of an essay. Other than that, I used a rudimentary form of Latent Semantic Analysis (LSA), which is a technique to estimate the similarity in meaning between different pieces of text, where the semantic space (think about words floating in a high dimensional space where their proximity in that space is related to how close they are in meaning) was created by just the raw text of a single essay set. For those not familiar with LSA, doing this would create a very small and substandard semantic space compared to those using much more expansive data sets, which is typically the case. These variables were combined in a fashion to optimize the scoring metric used to judge the algorithms known as the quadratic weighted kappa, and the method that I devised to do this may have been the most novel thing about my algorithm.

The problem with my AES engine: you could just write a bunch of random characters and eventually you would write enough to get a perfect score. Or you could write down a bunch of words that were related to the prompt that you thought other students would use and place them in whatever nonsensical order you wanted. You needn’t worry about capitalization, punctuation, or anything like that and you certainly don’t need to worry about whether what you were writing was factually correct. Unfortunately, other well-established AES engines fall prey to being easily “gameable” too, despite the fact that they are using much more refined natural language processing (NLP) techniques (see Facing a Robo-Grader? Just Keep Obfuscating Mellifluously). When the essays used in the ASAP competition were originally written, the students understood that their essays would be graded by a human, and thus made no attempt to game the system. However, had a student known they would be graded by a particular AES engine and through feedback from using the system over time knew what writing behavior let them receive high scores, wouldn’t they have just eventually exhibited that behavior? This is known as Goodhart’s Law: as soon as a method is designed to provide a proxy measure for some construct, then behavior is adapted to affect the proxy measure, which doesn’t necessarily affect the construct in the same way and the proxy no longer carries the information that had qualified it as a proxy in the first place.

To be clear, it is not that I think that a computer can’t ultimately account for the same things that human raters do, nor that they can’t eventually do a better job at scoring essays based on those things. It is just that the ASAP competition could not adequately addressed the issue of prediction vs. measurement and it did not place any constraints to ensure that competitors used only causal attributes or were immune to students “gaming” or slowly tuning their writing styles to the tastes of the algorithms they created, which would be disastrous. To the credit of Dr. Mark Shermis, Dean of Education at the University of Akron and main author of the study, he has acknowledged the existence of these specific issues in the report, avoided claiming that any AES engine is better than another, and has additionally stated himself that despite the results of the study, AES engines are not ready to be used beyond low-stakes classroom situations where they are supplemental to, and do not replace, the teacher.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.