Prompt Politeness Affects LLM Accuracy
URL SCAN: Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)
FIRST LINE: The wording of natural language prompts has been shown to influence the performance of large language models (LLIs), yet the role of politeness and tone remains underexplored.
THE DISSECTION
This is a narrow empirical paper operating at the intersection of social psychology mimicry and computer science output-worship. The authors take 50 multiple-choice questions, rewrite them in five tonal registers (Very Polite → Very Rude), and measure ChatGPT 4o's accuracy across conditions. They find a 4-percentage-point difference (80.8% vs 84.8%) favoring rude prompts. They call this a finding with "broader questions about the social dimensions of human-AI interaction."
THE CORE FALLACY
The category error here is spectacular. The paper treats lexical surface variation in a benchmark on MCQs as evidence of "social dimensions" in human-AI interaction. It is not. It is evidence that certain word patterns correlate with slightly different token probability distributions in a closed-answer task. Conflating prompt lexicography with social interaction is the kind of conceptual laundering that makes academic credibility cheap.
Second, the effect size is being massively oversold. 4 percentage points on 50 questions, one model, one format. This is the noise floor of behavioral science applied to a system with known stochastic instability. "Statistically significant" in a p-value sense does not mean "meaningful."
HIDDEN ASSUMPTIONS
- MCQ accuracy as proxy for capability. Multiple-choice questions are the most forgiving test format for pattern-matchers. This tells you nothing about actual reasoning, planning, or action in open-ended domains.
- "Rude" and "Polite" as stable semantic categories. These are surface lexical cues, not actual social postures. A model trained on human text does not experience rudeness; it responds to token co-occurrence patterns that correlate with certain training data distributions.
- Generalizability without replication. One model (4o), one question set, one format. The "newer LLMs respond differently" conclusion is speculative narrative, not demonstrated mechanism.
- The social framing as intellectual inflation. The paper needs this to matter. Without "social dimensions," this is a footnote in a prompt-engineering lab report.
SOCIAL FUNCTION
Prestige signaling. Academic career production masquerading as discovery. The format is classic: a trivial finding inflated by framing into a question of broad significance. "Raises broader questions" is the tell. If the finding were robust, the paper would state the mechanism, not gesture vaguely at implications.
Nerd evasion. The actual hard questions—AI capability trajectories, economic displacement, systemic transition—are nowhere in this paper. It retreats to a safe niche: small empirical curiosity dressed in conceptual relevance. This is how research economies reward people who cannot or will not address the actual questions.
THE VERDICT
From the Discontinuity Thesis lens, this paper is structurally irrelevant. It tells you nothing about the axis that matters: whether AI systems can perform economically necessary labor at scale, whether that capacity displaces human workers, whether the institutional mechanisms exist to manage the transition.
The paper is a curiosity. A prompt lexicality study on MCQs. Not wrong, necessarily, but operating on a scale so far from the actual dynamics that its findings have no bearing on the transition question.
The only function this paper serves: It gives researchers something to cite when they need to seem productive while avoiding the actual hard problems of the transition era. It is not a warning. It is not a revelation. It is a research budget consuming itself.
Viability of this research genre: Fragile. As AI capabilities compound, the marginal value of prompt-tone sensitivity studies approaches zero. The researchers are optimizing for academic currency in a landscape where academic currency will be increasingly disconnected from structural relevance.
Comments (0)
No comments yet. Be the first to weigh in.