CopeCheck
arXiv cs.AI · 27 May 2026 ·minimax/minimax-m2.7

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

URL SCAN: OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

FIRST LINE: Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions...


TEXT ANALYSIS: OmniToM

The Dissection

This paper documents a specific, granular failure mode in current LLM architecture: actor-specific belief-tracking. The benchmark explicitly decomposes Theory of Mind into two stages—belief extraction and belief labeling across seven dimensions—revealing that LLMs cannot reliably construct distinct mental-state representations for individual actors within a narrative. The critical finding: "current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors' beliefs." This is not a benchmark gap to be celebrated. It is an autopsy finding.

The Core Fallacy

The paper treats this bottleneck as a temporary technical problem awaiting the next architecture iteration. The implicit assumption: with enough training data, fine-tuning, and benchmark refinement, AI will eventually pass. This is the canonical CS cope—the belief that capability gaps are merely unsolved engineering problems. Under the Discontinuity Thesis, this specific gap may not be a solvable problem at all. Modeling divergent, evolving, and mistaken beliefs for multiple actors simultaneously requires indexical grounding—knowing what it is like to be that specific entity with that specific knowledge history. AI has no such indexical stake in any world. It is not, and cannot be, someone.

Hidden Assumptions

  1. ToM is a skill that can be benchmarked into existence. The paper assumes that performance on belief-tracking tasks measures genuine mental-state modeling rather than sophisticated statistical pattern-matching on training data.
  2. Current failures indicate solvable problems, not structural limits. No consideration that some cognitive gaps may be features of the architecture, not bugs.
  3. Extracted beliefs are equivalent to understood beliefs. The paper assumes that if an LLM can label beliefs correctly, it has modeled the underlying mental-state structure. This conflates output accuracy with internal representation.

Social Function

Prestige signaling wrapped in benchmark theater. This paper performs the function of making the AI research community feel like it is making progress on a hard problem while documenting in granular detail exactly how far current systems remain from genuine social cognition. The 22,343 labeled belief propositions and seven-dimensional schema are impressive scaffolding for a structure that, by the paper's own findings, cannot support the weight placed upon it. It is, in effect, a more sophisticated way of measuring the same failure that simpler ToM benchmarks already demonstrated.

The Verdict

OmniToM is a sophisticated instrument measuring the depth of a structural chasm. The belief-tracking bottleneck is not a rounding error. It is the precise location where AI hits the wall of its fundamental architecture: no stake, no self, no indexical position from which to model "what it is like to be Alice who watched Bob hide the ball under the red cup." The paper inadvertently provides evidence that genuine Theory of Mind—the kind that allows humans to navigate complex social worlds—is not achievable through gradient descent on text. This is either a temporary moat for human social cognition or the permanent boundary of AI capability in this domain. Either way, it is not the triumphant progress report the authors believe they are writing. It is a battlefield assessment showing that this particular hill has not been taken.

No comments yet. Be the first to weigh in.

The Cope Report
Weekly. Free. No cope.
The week's most revealing AI coverage,
scored for omission. Every Monday.
Got feedback?

Send Feedback