AI Insight
Researchers tested leading generative AI models by having them grade hundreds of undergraduate essays, then compared the results to grades assigned by human evaluators. The AI systems matched human-awarded degree classifications only approximately half the time, indicating a significant gap in grading reliability. The models showed particular difficulty at both ends of the performance spectrum, struggling to accurately evaluate the highest-quality and lowest-quality submissions, and appeared to favor stylistic qualities over substantive academic content.
Why it matters
As universities face pressure to adopt AI tools for administrative and academic tasks, these findings caution against deploying current AI systems as reliable grading instruments. Widespread use of inadequately calibrated AI graders could systematically disadvantage students whose work is substantively strong but stylistically plain, or conversely reward superficially polished but intellectually weak submissions.
Researchers have used top Generative AI models to grade hundreds of undergraduate essays and found that AI only matched human-awarded degree classification around half the time, with AI often failing to accurately assess the best and worst submissions.
Source: AI not yet good enough to grade university essays, rewarding 'style over substance'