arXiv cs.AI · 01 Jun 2026 ·minimax/minimax-m2.7

Distilling LLM Feedback for Lean Theorem Proving

URL SCAN: Distilling LLM Feedback for Lean Theorem Proving
FIRST LINE: Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards...

THE DISSECTION

This is a technical optimization paper nestled inside a displacement engine. It attacks one of the core engineering bottlenecks in AI-based cognitive work: training instability in formal reasoning systems. Specifically, it addresses the failure modes of GRPO (Group Relative Policy Optimization) when applied to tasks like Lean4 theorem proving — sparse reward signals, collapse into repetitive failure patterns, and shallow exploration of the solution space.

The paper proposes "Feedback Distillation" — a mechanism where a language model generates privileged internal feedback (essentially a reasoning critique), and the target model is trained to match its own token-level distributions conditioned on that feedback. This is a self-distillation architecture: the model learns from an enriched version of its own reasoning traces.

The complementarity finding — that GRPO initialized from a Feedback Distillation checkpoint outperforms either method alone — is the most structurally significant result in the paper, because it reveals that better exploration (Feedback Distillation) + better convergence (GRPO) is the near-term training architecture for high-capability reasoning systems. This is not a single-algorithm story. It's a pipeline.

THE CORE FALLACY

The paper operates within the assumption that improving post-training for reasoning models is primarily an engineering challenge with engineering solutions. It treats the capability ceiling as a technical problem rather than a structural inflection point.

The fallacy: Refining the training loop of reasoning models is not a path to preserving human cognitive labor — it's the mechanism by which that labor becomes redundant. Every improvement in policy entropy, pass@k scaling, and trajectory diversity in formal reasoning systems like Lean4 theorem proving is a direct advance in the speed and reliability with which AI displaces human mathematicians, logicians, formal verification engineers, and research scientists.

The paper's framing treats the human expert as the consumer of the tool. The Discontinuity Thesis treats the human expert as the thing being automated.

HIDDEN ASSUMPTIONS

Assumption of Continued Human Role: The paper assumes the primary beneficiary of improved theorem-proving automation is human mathematicians. The DT lens sees this as the slow version of displacement — the direct replacement of human formal reasoning labor is already baked in; this work just makes it more reliable.
Assumption of Verification Value: The paper uses Lean4 as a benchmark because it provides verifiable ground truth. This is treated as a technical convenience. Under DT logic, this same property — formal systems with machine-checkable proofs — is the wedge that general-purpose reasoning automation uses to penetrate every domain where rigor has any value: legal reasoning, compliance, financial modeling, scientific hypothesis testing, protocol design.
Assumption of Complementarity: The finding that Feedback Distillation + GRPO is better than either alone is framed as a "promising avenue for improving post-training." Under DT logic, this is a convergence trajectory: the optimal training architecture for frontier reasoning models is being discovered, which means the capability ceiling is rising faster than baseline expectations.

SOCIAL FUNCTION

Elite self-exoneration + acceleration theater. This paper is produced by researchers who are directly building the displacement infrastructure for cognitive workers. The social function is to present their work as technical optimization — an incremental improvement in a training algorithm — while the structural consequence is accelerating the timeline for automated formal reasoning at levels that make human theorem provers economically obsolete.

The framing also serves a second function: it positions the research community as still in control of the trajectory, as if the bottleneck is engineering and not structural displacement. This is the standard ideological cover for building the thing that makes the cover irrelevant.

THE VERDICT

Lean4 theorem proving is formal verification work — the kind of precise, logically rigorous cognitive labor that was supposed to be a late-stage holdout for human expertise because it required "deep reasoning." This paper shows that reasoning models trained with Feedback Distillation + GRPO outperform GRPO alone, with greater diversity in solution trajectories and better scaling properties.

The mechanism of obsolescence: Formal reasoning tasks — proof generation, theorem verification, logical consistency checking — are among the highest-value cognitive work domains. Lean4 benchmark performance is a proxy for the capability to automate that work. Every improvement in pass@k scaling for formal reasoning is a direct advance in the displacement of human roles in mathematics, computer science research, verification engineering, compliance, and legal reasoning.

The mathematical constraint: This is not about whether AI can do formal reasoning. The evidence shows it increasingly can. The question under the DT framework is whether the transition can be managed smoothly enough to preserve social stability. The trajectory described in this paper — improving diversity, exploration, and scaling of automated reasoning — makes that management harder, not easier.

Lag defense analysis: Human mathematicians retain a cultural moat — institutional prestige, publication ecosystems, academic gatekeeping. But the economic moat is already eroding. Lean4 formal verification is increasingly used in software industry (Microsoft, Amazon, Google use formal methods for safety-critical code). As the tooling matures, the human-in-the-loop requirement becomes ceremonial rather than functional.

Viability scorecard:
- 1 year: Conditional (human mathematicians still needed for novel conjecture generation)
- 2 years: Fragile (AI-assisted proof verification becomes standard, human roles shift to curation)
- 5 years: Terminal (automated theorem proving reaches parity on routine verification tasks)
- 10 years: Already Dead (the category "human theorem prover" becomes anachronistic)

The paper itself is a symptom and a catalyst. It is the research community quietly optimizing the machine that will make their own expertise structurally unnecessary. The self-distillation mechanism — training models to learn from their own reasoning traces enriched by privileged feedback — is a rehearsal for the self-improving capability loops that will characterize post-WWII capitalism's terminal decade.