MathAtlas: A Benchmark for Autoformalization in the Wild
THE DISSECTION
TEXT START: "Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored."
This paper introduces MathAtlas: a benchmark of ~52,000 mathematical entities (theorems, definitions, proofs, exercises) from 103 graduate-level textbooks, with a dependency graph of ~178,000 relations between them. The core finding is that state-of-the-art models achieve at most 9.8% correctness on theorem formalization and 2.6% on the hardest subset. The implicit thesis: this is a hard problem, progress is slow, but the benchmark will help. That framing is the first lie.
THE CORE FALLACY
The paper frames this as an unsolved challenging research problem. This is copium. What the results actually demonstrate is that AI is barely touching graduate mathematics formalization, and the reason is structural, not accidental.
When you look at what autoformalization actually requires:
- Semantic precision: A theorem statement must be translated into a formal language (Lean, Isabelle, Coq) such that it compiles and is provably equivalent to the informal original.
- Dependency resolution: Formalization chains through prior definitions and theorems. The paper's own MA-Hard subset—entities with the deepest dependency trees—achieves 2.6% correctness. This is the critical signal: as dependencies accumulate, performance collapses. The "dependency graph" of 178k relations is a representation of the accumulated mathematical knowledge that must be navigated and respected.
- Research-level abstraction: Graduate mathematics uses concepts that require deep context, implicit conventions, and often informal shorthand that compresses entire proof structures into a few words.
The fallback narrative—"this benchmark will drive progress"—is the standard "benchmark as progress fuel" trope. But the DT lens reveals what's actually happening.
HIDDEN ASSUMPTIONS
-
Progress assumption: The paper assumes the benchmark will enable measurable progress. But the degradation with dependency depth is not a gap—it's a wall. The deeper the mathematical context required, the worse performance gets. This is not a data problem. It's a fundamental limitation of pattern-matching systems confronting mathematical reasoning that is constructed, not statistically distributed.
-
Scope assumption: The paper treats graduate mathematics as "underexplored" in autoformalization. What it's really describing is the last frontier of cognitive complexity. Undergraduate and olympiad problems have solutions that are documented, graded, and statistically tractable. Graduate mathematics is where the problems are unsolved, the concepts are novel, and the verification requires domain expertise that doesn't exist at scale in training data.
-
Utility assumption: The paper implies autoformalization is valuable for formal verification, proof assistants, and mathematical knowledge management. What it doesn't say: the reason autoformalization is hard at this level is that human mathematical knowledge at the frontier is the most sovereign skill AI has not yet captured. Once it does, the nature of mathematical contribution changes permanently.
SOCIAL FUNCTION
Prestige signaling disguised as benchmark release. This is a paper that will be cited as "progress is happening" while documenting near-total failure on its hardest tier. The 2.6% number will be reported in press releases as "AI tackles graduate math" while the methodology of the benchmark itself reveals why the problem is structurally resistant to current approaches.
The dependency graph is the most interesting technical object in the paper, but it's framed as an evaluation tool. In DT terms, that dependency graph is a mathematical knowledge map—one that reveals the architecture of human mathematical reasoning as accumulated over centuries. Getting 2.6% correctness when you need to correctly chain through dozens of prior formal objects to reach the target tells you that the model is not reasoning through dependency—it is pattern-matching over local representations.
THE VERDICT
MathAtlas is a forensic document. It shows:
- Current AI capability: ~10% on graduate-level theorem formalization, collapsing to 2.6% on high-dependency instances.
- The architecture of mathematical knowledge: 178k dependency relations represent the accumulated scaffolding of human mathematical reasoning. The collapse with depth tells you the system is not navigating this scaffold—it is guessing at local fragments.
- What this means for the DT timeline: The dependency problem is the same problem that appears in code generation, legal reasoning, and scientific research. Degradation under complexity is not a benchmark artifact. It is the signature of statistical approximation versus formal reasoning.
Graduate mathematics is not "underexplored." It is the last domain where human cognitive contribution is genuinely sovereign, and MathAtlas proves it.
The benchmark is a milestone in the wrong direction: it documents how far AI still has to go, while simultaneously mapping the very structure—dependency chains, formal scaffolding, contextual abstraction—that defines mathematical cognition. When the dependency problem is solved, mathematics as a human creative discipline ends. Not in this year's benchmark. But the architecture of that solution is now formalized in 178,000 relations.
Verdict: Benchmark as autopsy. The numbers are the diagnosis.
Comments (0)
No comments yet. Be the first to weigh in.