The rapid advance of generative AI has changed how people write. AI is now built into many everyday writing tools, helping users generate ideas, draft content, revise sentences, and improve their writing. As a result, writing is increasingly becoming a collaborative process between humans and AI. For students, educators, and testing organizations, this raises a fundamental question: When AI becomes part of the writing process, what essential writing skills should we value, and how should we measure them?
This shift also challenges existing automated scoring systems, which were developed largely on the assumption that essays were written independently by humans. Features such as grammar, usage, mechanics, and organization have long been used as indicators of writing quality and are a key part of many automated scoring models. But when AI can improve these aspects of writing with minimal effort, their role in automated scoring needs to be reconsidered. This challenge is most relevant to unproctored writing assignments, where AI use is difficult to control, rather than to formal proctored writing tests where access to such tools can be restricted.
A recent paper, “AI-Generated Essays: Characteristics and Implications for Automated Scoring and Academic Integrity,” published in Educational Measurement: Issues and Practice (EM:IP), explores this issue through the lens of the GRE Analytical Writing Assessment. The study, which evolved from an ETS summer internship project, compared AI-generated essays with human-written essays and evaluated them using both trained human raters and ETS’s automated scoring engine e-rater. The findings reveal important differences between AI-generated and human-written essays and offer useful insights for the next generation of automated scoring systems.
Automated scoring faces a new challenge
Automated scoring plays an important role in large-scale writing assessment. These systems often rely on language features such as grammar, usage, mechanics, style, organization, and word choice because they can be efficiently computed with NLP techniques. While these features are part of the construct in many language tests, in tasks focused more on argumentation and reasoning, they often serve as indirect indicators of deeper writing quality rather than direct evidence of the quality of ideas, evidence, or reasoning.
For example, a student who writes with accurate grammar, clear organization, and well-developed paragraphs often also demonstrates stronger reasoning and communication skills.
Generative AI changes that relationship. AI-generated essays can score highly on language-related features because the technology can produce polished, well-structured writing. However, strong language features from AI-generated essays do not always come with strong reasoning, meaningful analysis, or original thought.
As a result, some of the features that have traditionally been good indicators of writing quality become less reliable when essays are generated or heavily assisted by AI.
What the study found
The study revealed two important findings.
First, AI-generated essays consistently outperformed human-written essays on language-related features, even when the underlying ideas or arguments were relatively limited. Second, e-rater® assigned higher scores to AI-generated essays than human raters did.
This difference reflects how automated scoring systems have traditionally been developed. e-rater® was trained using human-written essays, where strong language use are typically associated with stronger overall writing. As a result, these features play an important role in the scoring process.
AI-generated essays can perform extremely well on these language-related features while still lacking strong analytical reasoning, use of evidence and depth of argument. When the e-rater® assign the same weights to these features when evaluating AI-generated essays, it will inflate the scores.
Human raters, by contrast, assess not only language quality but also the quality of reasoning, use of evidence, and development of ideas, as guided by the scoring rubric. This explains why human raters did not score AI-generated essays as highly as the automated system.
Importantly, these findings do not suggest that e-rater® is flawed. Rather, they highlight how generative AI has changed some of the assumptions on which existing automated scoring systems were built.
What automated scoring needs next
Automated scoring systems do more than assign scores. Before scoring begins, they typically check whether a response is appropriate for scoring at all. Traditionally, this step has focused on flagging essays that are off-topic, unusually short or long, repetitive, memorized, or otherwise not appropriate for scoring.
As AI-assisted writing becomes more common, this initial screening process needs to expand to identify AI-generated or heavily AI-assisted responses when the use of AI is not allowed. In fact, findings from the EM:IP paper show that essays generated by a range of generative AI models can be detected with high accuracy. However, detection methods will need to be continuously updated as new AI models emerge.
At the same time, automated scoring systems need to reconsider how much emphasis they place on different aspects of writing. Surface-level language features may be less useful indicators of deeper level reasoning of the writing when AI can improve them with minimal effort.
Future systems should place greater emphasis on deeper qualities of writing, such as the effective use of evidence, quality of reasoning, depth of analysis, and strength of argument.
The future of writing assessment
AI-assisted writing is here to stay. As these tools become part of everyday writing, the central question is no longer how to detect or prevent their use, but how to redefine what we expect to measure from writing in this new environment.
Answering that question will require agreement on several important issues, including what level of independent writing ability is expected, what kinds of AI assistance are appropriate, and what evidence should be used to evaluate writing quality. Automated scoring systems must evolve alongside this broader conversation, so that they continue to support valid and meaningful judgments about writing in the age of AI.