arXiv cs.CY · 26 May 2026 ·minimax/minimax-m2.7

Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning

URL SCAN: Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning

FIRST LINE: Intelligent tutoring systems increasingly provide automated feedback on student work...

THE DISSECTION

This paper is a forensic audit of AI tutoring systems, and what it actually reveals is far more damning than its "modest academic contribution" framing suggests. The authors document a specific failure mode—The Correct Answer Trap (CAT)—where AI models can't distinguish between a student who understands and a student who stumbled into the right number through broken reasoning. They find that even frontier LLMs achieve only 57% detection accuracy on this task, generating four false positives for every genuine catch.

THE CORE FALLACY

The paper treats this as a technical problem amenable to better models and fine-tuning. This is the profound misunderstanding. The Correct Answer Trap isn't a bug in AI tutoring—it's a structural feature of how AI evaluates cognition.

The core issue: reasoning assessment requires tacit context about why a reasoning path is wrong in this specific student's mental model given their educational history. AI can pattern-match against known misconceptions, but genuine reasoning assessment requires the teacher who knows the student's journey, not just the answer matrix.

THE HIDDEN ASSUMPTION

The paper assumes that with enough data, better models, and careful benchmarking, AI can eventually perform reliable reasoning assessment at scale. This is a functional assumption that assumes human cognitive evaluation can be automated if the ML pipeline gets sophisticated enough.

The DT counter: AI can identify categories of error, but it cannot assess developmental reasoning in the way that matters for human capability formation. The paper's own data proves this—improved capabilities reduce but cannot eliminate the problem. This isn't a current-generation limitation. It's the nature of the task.

THE SOCIAL FUNCTION

This is prestige signaling wrapped in careful empiricism. The authors frame their work as "useful for AI-assisted education" while their data shows AI tutoring is unreliable for the most pedagogically important task (detecting flawed reasoning behind correct answers). The conclusion—"human judgment still benefits"—is the academic equivalent of "please don't take my research funding."

THE VERDICT

This paper inadvertently demonstrates that AI cannot perform authentic cognitive evaluation, even on a constrained, well-defined task with clean data. The Correct Answer Trap is not a model deficiency. It's a structural impossibility for systems that lack genuine understanding of the student's cognitive state.

The implication for the Discontinuity Thesis: If AI cannot reliably assess whether a human is reasoning correctly, then AI cannot reliably assess whether a human has acquired a skill. This means AI cannot reliably certify human competence. And if AI cannot certify human competence, then the "AI-assisted education" paradise the ed-tech industry is selling is a hollow promise—not because the technology is too primitive, but because the task is definitionally resistant to automation.

The paper's findings are a death knell for the "AI will make education scalable and cheap" fantasy. You can automate content delivery. You cannot automate the assessment of whether a human is actually thinking.