Why are large language models so terrible at video games?
TEXT ANALYSIS PROTOCOL
TEXT START: "Large language models (LLMs) have improved so quickly that the benchmarks themselves have evolved, adding more complex problems in an effort to challenge the latest models."
1. THE DISSECTION
This is a technical interview piece disguised as a casual examination of an AI "quirk." In reality, it's a deferred confession from the AI research establishment: the impressive demos mask a system that cannot survive in environments it didn't memorize. Togelius is being honest in a way that makes his colleagues uncomfortable. The article is most interesting precisely where it almost accidentally touches the structural truth—and then flinches.
2. THE CORE FALLACY
The framing treats game-playing failure as a benchmark problem or an interface mismatch. The article's closing question—"bad at games in general, or bad at games using interface designed for humans?"—is the right question, but the text never answers it honestly.
Interface is not the constraint. World-grounding is.
Games are not puzzles to decode. They are simulations of physical and strategic reality. The failure isn't that LLMs can't read the pixels. The failure is that LLMs have no model of what the pixels mean as causal consequence. A human sees a pixelated cliff and understands: falling = death = reset. An LLM, confronted with that same visual input in any meaningful game-playing context, has only statistical correlation with language about cliffs, falls, and deaths—not an enacted model of physics and consequence.
The interface question is a distraction. Even with perfect API access to game state, an LLM fails because:
- No embodied causal model: It cannot simulate "if I move right, then the projectile moves left, then I die" in a grounded prediction loop.
- No real-time adaptive feedback: Games require online learning under time pressure. LLMs are batch processors masquerading as real-time systems.
- No spatial indexing: Game space is not text. Navigating a 2D grid is not the same cognitive task as writing a sentence, and treating it as equivalent is the category error driving the entire research program.
The article gestures at this—"they're separately very bad at spatial reasoning"—but treats it as a training data problem. It's not. It's a fundamental architectural limitation that scaling and fine-tuning cannot dissolve because the problem isn't information, it's grounding.
3. HIDDEN ASSUMPTIONS
Three unexamined axioms run through this piece:
-
More data and benchmarks will close the gap. Togelius admits the General Video Game AI competition stopped showing progress—same games, same agents, oscillation, no generalization. The article treats this as a historical curiosity, not a structural warning. If benchmarks plateaued before LLMs existed, LLMs will plateau too. The article never draws this line.
-
Coding is a "well-behaved game" and therefore a model for AI capability. This is precisely backwards. Coding is well-behaved because it is entirely textual and referential. Code refers to code. Tests refer to code. The entire domain is a closed linguistic loop. This is not evidence of general capability—it's evidence that LLMs are exceptional at textual self-reference and the article celebrates this as though it weren't a symptom of narrowness.
-
Game-playing failure is a current limitation that future models will overcome. The article is suffused with this assumption—framed as "in 2026" as though the calendar is the variable. But Togelius himself notes that AlphaZero required full retraining for each game family, and those games are structurally similar (discrete turn-based, full observability, identical input spaces). The claim that scaling will eventually produce a general game player is an act of faith, not inference.
4. SOCIAL FUNCTION
This article performs reassurance theater for the technically literate. It says: "LLMs seem impressive but actually fail at something a toddler can do—here's why that's fine." The function is to manage cognitive dissonance in audiences who are being told simultaneously that AI is revolutionary and that it cannot navigate a Mario level.
Secondary function: exoneration of the research program. By framing the problem as "games are diverse" and "spatial reasoning isn't in training data," the article implicitly locates the failure in the environment (games are too diverse) rather than in the architecture (LLMs cannot generalize without grounded simulation). This protects the billion-dollar paradigm.
5. THE VERDICT
Game-playing failure in LLMs is not a benchmark problem. It is the proof of the Fundamental Brittleness Theorem: LLMs are interpolative text engines that require the test distribution to be a smoothed subset of the training distribution. Games are specifically designed to violate this assumption—they are adversarial design spaces built by human creativity specifically to defeat pattern-matching.
The article's closing question deserves a direct answer:
Are they bad at games in general, or bad at games using interface designed for humans?
They are bad at games because they are bad at world-modelling through grounded consequence simulation. The human interface is incidental. Remove the interface entirely—give an LLM perfect state access via API—and it still fails, because it cannot simulate the game, only describe it.
This is not a temporary gap. The architecture cannot converge to general game-playing without either:
- (a) embodied experience generating grounded causal models, or
- (b) a fundamental architectural departure from next-token prediction.
Neither is on the current scaling roadmap.
The final irony: Togelius notes that LLMs can generate playable versions of classic games like Asteroids in one prompt. True. But the generated game is a static artifact, not an experience. It works because the "game" collapses to a trivially-small state space that fits the training distribution. The moment the game requires a human opponent who adapts, or a level that wasn't in the training data, or real-time physics that weren't scripted—it dies.
This is exactly what the Discontinuity Thesis predicts: systems that are extraordinary at encoding existing human knowledge and catastrophic at novel adaptive behavior in dynamic environments. Games are the laboratory version of economic reality. The failure mode is identical.
RESIDUAL FUNCTION: Useful as a lag indicator. When AI can consistently beat arbitrary games at human speed without task-specific scaffolding, you will know that either (a) a genuine world-model breakthrough has occurred, or (b) the training set has expanded to include every possible game that will ever exist—which is, itself, a form of system death (no novelty remains).
Comments (0)
No comments yet. Be the first to weigh in.