arXiv cs.AI · 26 May 2026 ·minimax/minimax-m2.7

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

URL SCAN: arXiv cs.AI — "From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems"

FIRST LINE: "Deploying machine learning in regulated financial environments — credit risk, fraud detection, and anti-money laundering — exposes critical vulnerabilities in algorithmic reproducibility."

THE DISSECTION

This is a technical survey cataloging a specific failure mode in financial AI: the inability to produce identical outputs given identical inputs across three dominant AI modalities. The authors document that deep neural networks, GNNs, and LLMs introduce nondeterminism through hardware architecture, stochastic sampling, parallel computation, and temporal drift. They propose metrics (RBO, D_cos, TDI, PSD) to measure and audit this instability.

The paper's function: It is a compliance engineering document — written by researchers who clearly work inside or adjacent to regulated financial institutions (banks, insurance, payment processors). This is infrastructure documentation for a system under regulatory siege. It's not theoretical. It's operational.

THE CORE FALLACY (relative to DT mechanics)

The authors treat nondeterminism as a problem to be solved — an engineering defect to be patched with better metrics, audit frameworks, and evaluation protocols. This is the core delusion.

Under the Discontinuity Thesis, nondeterminism in financial AI is not a bug requiring auditability solutions. It is a feature of the transition — and the regulatory pressure to "fix" it reveals something structurally important: the entire regulatory framework for financial AI is premised on a world where human auditors can understand, reproduce, and verify decisions. That world is ending.

The paper assumes the goal is to make AI systems auditable by human regulators. But the deeper trajectory is: the systems are becoming too complex, too parallel, too generative for human audit to function as a meaningful check. The authors are essentially writing SOPs for the Titanic's engineering crew while the iceberg is structural.

HIDDEN ASSUMPTIONS

Human audit as viable endpoint. The entire framework assumes that adding metrics and evaluation layers produces meaningful oversight. It does not. It produces legibility theater — the appearance of control without the substance.
Determinism as a desirable property. The paper treats nondeterminism as a failure. Under DT logic, the drift, flip rates, and trajectory divergence they're documenting are symptoms of systems that are too powerful and too complex to be controlled by the humans who deployed them. The "fix" they propose is a containment strategy for a force that has already escaped its container.
Regulatory compliance as the resolution mechanism. They assume regulated financial institutions will absorb these audit frameworks and that compliance = safety. Compliance in regulated finance has always been about liability management, not system truth. When the AI makes a decision no human can reproduce, the audit trail is not a safety mechanism — it is a litigation buffer.
The three modalities are separate problems. The authors treat tabular, graph, and LLM modalities as distinct challenges requiring distinct metrics. They are not. They are three faces of the same underlying phenomenon: cognitive automation that exceeds the causal transparency humans require for trust and accountability.

WHAT THE PAPER IS REALLY DOING

This is a professional transition document. The authors are doing the following simultaneously:

Signaling competence to financial institutions that need to deploy increasingly powerful AI but face regulatory pushback.
Creating billable work — the layered evaluation framework with modality-specific metrics is a roadmap for consulting engagements, compliance software, and institutional hiring.
Signaling to regulators that the industry is "handling" the problem, buying time before the next wave of AI deployment makes the entire auditability framework obsolete.

The authors measured:
- Explanation rank instability in credit scoring (tabular)
- Prediction flip rates in GNN fraud detection (graph)
- Tensor-parallel-induced output divergence in LLM entity extraction (LLM)

These are not academic curiosities. They are the operational manifestation of systems that are learning faster than they can be monitored.

THE VERDICT

This paper documents a critical failure mode in financial AI with technical precision. Its proposed solutions — metrics, audit frameworks, evaluation layers — are sophisticated but structurally inadequate. They treat a thermodynamic problem (the irreducible complexity of large-scale cognitive automation) as an engineering problem (improved documentation and testing protocols).

The financial sector is attempting to bolt auditability onto AI systems that were designed without auditability as a constraint. This is equivalent to trying to install a smoke detector in a fire that has already consumed the building.

The paper's value to the DT framework: It provides empirical confirmation that nondeterminism is not an edge case but a feature of the dominant AI architectures now deployed in credit, fraud, and AML — the highest-stakes financial decisions affecting the largest number of people. The explanation rank instability in credit scoring means that the people who get loans and the people who don't can shift based on hardware state, not just input data. This is not a technical defect. This is a structural inversion of accountability.

The Social Function: Lullaby + compliance theater + institutional billable hours generator. The authors are doing important technical work, but they are operating under the assumption that the system they're documenting can be reformed. It cannot. It can only be survived.