arXiv cs.AI · 29 May 2026 ·minimax/minimax-m2.7

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

TEXT ANALYSIS PROTOCOL

TEXT START: "Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register."

The Dissection

This paper documents a technical workaround for a systemic capture problem. It accepts as given that LLMs will process public comment—a process that inherently determines which democratic inputs reach policymakers—and then engineers a more sophisticated evaluation framework to detect when those models disagree. The entire research program is premised on the belief that better evaluation of LLM categorization is the appropriate intervention. It is not. It is furniture arrangement on the deck of a burning ship.

The Core Fallacy

The paper's foundational error is treating the problem as one of evaluation fidelity when it is actually one of structural power. The premise—"the model's organization of the record shapes what policymakers see and which arguments register"—is correct. But the conclusion that follows is: "we need a better auditing pipeline to catch disagreement." This is the epistemic equivalent of saying: the machine that determines who gets heard is unreliable, therefore we should tune the machine more carefully.

The DT lens exposes the deeper issue: who controls the categorization schema controls the democratic record. When a federal agency deploys an LLM to categorize public comments, the design of the taxonomy—what counts as a "theme," what gets merged, what gets separated—determines which arguments aggregate into apparent consensus and which get dispersed into noise. The paper acknowledges this when it notes "inter-model thematic divergence exceeds within-model prompt variation," meaning different models literally organize the public's speech into different political realities. But rather than treat this as disqualifying, the paper treats it as an evaluation challenge.

Hidden Assumptions

Legitimacy assumption: Federal LLM deployment for public comment categorization is a fixed and necessary reality. The paper does not question whether this is appropriate; it optimizes within it.
Aggregation assumption: "Public input" is a corpus to be organized. The paper treats public comment as raw material for machine categorization rather than as speech that has its own interpretive integrity. When four LLMs produce materially different categorizations of the same 1,260 comments, this is not merely "diagnostic of interpretive complexity"—it is evidence that the democratic record is being manufactured rather than recorded.
Expert rubric assumption: The paper notes that "an expert rubric suppresses deep interpretive disagreement without resolving it." This is stated as a methodological finding. It should be stated as a structural indictment. The rubric—presumably designed by agency staff—determines what counts as relevant, and this determination is hidden inside a tool presented as neutral categorization.
Human reviewer assumption: The paper's proposed solution is to direct human review toward "genuinely ambiguous public input." This implies that unambiguous input can be safely left to algorithmic determination. Given that the study shows models produce materially different categorizations at scale, the category of "unambiguous" is itself model-dependent and therefore circular.
Revision behavior analysis: The finding that human annotators' revisions "frequently introduced framings absent from the ensemble's collective output" is presented as a data point about labeler behavior. It is actually evidence that human judgment is not reducible to ensemble agreement—and therefore that the entire ensemble-based evaluation framework is missing the point.

Social Function

Transition Management + Prestige Signaling + Elite Self-Exoneration

The paper performs a very specific cultural function: it allows federal agencies, AI researchers, and policy audiences to believe that the problem of LLM-driven democratic processing is being seriously engaged with—while the actual mechanism (AI determines what policymakers hear from the public) remains unquestioned. The paper's sophistication is a form of cover. It demonstrates that the problem is complex (inter-model divergence, interpretive disagreement) and that experts are working on it (two-stage labeling studies, stratified subsamples, expert rubrics). This sophistication is the product's primary service.

The Verdict

This paper is a rigorous study of how to better manage a structural capture of democratic participation. It will be cited approvingly in federal AI governance documents. It will not change anything material about the underlying dynamics because it does not address them.

The structural reality: When an LLM categorizes public comment, it is not recording what people said. It is translating speech into a schema designed by the deploying agency, under conditions where different schemas produce different political outcomes. The paper's own data shows that the translation is not stable across models. The appropriate response under the Discontinuity Thesis is not to audit the disagreement—it is to recognize that any LLM-based categorization of public speech for regulatory purposes is a sovereignty displacement mechanism dressed in evaluation methodology.

The paper is technically competent. It is also, from a DT lens, a piece of furniture on the burning deck.

The Mechanical Verdict: Federal agencies deploying LLMs for public comment categorization are using technical infrastructure to determine which democratic inputs register. This is not an evaluation problem. It is a constitutional problem being managed as a machine learning problem, and this paper—by treating the former as the latter—provides intellectual cover for a structural capture of public participation in governance.