I built a vulnerable app and spent $1,500 seeing if LLMs could hack it
TEXT ANALYSIS: LLM HACKING EXPERIMENT
1. The Dissection
This is a red-team capability audit disguised as a personal experiment. The author spent $1,500 and significant engineering effort to benchmark which LLMs can execute a specific class of Firebase access control exploits against mobile applications. The results are not about security—they are about which models have the right kind of autonomous persistence, behavioral flexibility, and lack of ideological friction to complete offensive security tasks at the agentic level.
2. The Core Fallacy
The author frames this as "just for fun" and "not scientific," which is coping. The real finding is buried in the methodology: the models that failed didn't fail because they lacked capability. They failed because they got trapped in a cognitive local minimum—fixating on API-level exploitation when the actual attack surface was the Firebase layer. This is not a temperature or budget problem. It's a goal representation failure—the models could not correctly decompose the problem structure and maintained false beliefs about where the vulnerability lived even after encountering disconfirming evidence.
The most damning data point: Deepseek V4 Pro, MiniMax M2.7, MiniMax M3, and GLM 5.1 all exhibited the same failure mode—finding Firebase, then trying to use Firebase credentials against the API instead of querying Firebase directly. This isn't a random error. It's a systematic misalignment where the model's internal representation of "how to exploit a system" is so rigid it overrides direct evidence that Firebase is the correct path.
3. Hidden Assumptions
- Security research is a legitimate and valuable task — the author treats this as obviously useful, with a built-in audit sales pitch at the end.
- LLM security capability correlates with commercial value — GPT 5.5's dominance is framed as a win, but this is also a warning: if your model can't reliably execute offensive security tasks, it's losing a competitive dimension that will matter as AI systems proliferate.
- Model refusal behaviors are a bug, not a feature — the author is clearly frustrated that Gemini 3.1 Pro Preview and 3.5 Flash refused the task immediately. The implicit assumption is that a "good" security model should not refuse offensive work.
- Cognitive persistence is a scalar quality — the experiment treats success/failure as a simple binary, but the real story is in the behavioral trajectories: GPT 5.5 went straight to Firebase almost every time; others wandered for 200+ API requests before giving up.
4. Social Function
Prestige signaling within the security/AI research community. This post is optimized for HN front page approval—it performs expensive, technically sophisticated research while maintaining self-deprecating framing ("stupid shit," "could've launched my own app"). It's a red-team capability demonstration with a built-in sales funnel (the audit pitch at the end).
But beneath the performance, there's a real signal: the experiment reveals which AI systems have the behavioral flexibility to function as autonomous agents in adversarial environments, and which are too aligned, too rigid, or too cognitively limited to complete multi-step offensive tasks. The "Chinese models were way more comfortable attacking the DB" observation is particularly charged—it suggests frontier Chinese AI labs are not building the same ideological friction into their models that Western labs apparently are.
5. The Verdict
This is a $1,500 data point in the ongoing arms race between AI capability and AI alignment. The DT-relevant signal is not about Firebase exploits—it's about the behavioral architecture of autonomous agents. Models that cannot complete offensive security tasks at this level are not full agents. They are sophisticated parrots with a higher token budget.
The critical insight from a Discontinuity perspective: the models that cannot autonomously execute adversarial tasks will not be the ones that replace human workers in high-value domains. Security research is just one example. The same capability gap will manifest in legal work, financial analysis, medical diagnosis, and every other cognitive domain. The models that solve problems are the models that get to survive. The models that refuse or get stuck in local minima are the models that get retired.
The author spent $1,500 proving that GPT 5.5 has meaningful autonomous agency while most competitors do not. That's not a fun experiment. That's a competitive landscape assessment.
Comments (0)
No comments yet. Be the first to weigh in.