Related papers: Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)

Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)

URL: http://arxiv.org/abs/2510.04950v1
Date: Mon, 06 Oct 2025 15:50:39 GMT
Title: Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)
Authors: Om Dobariya, Akhil Kumar,
Abstract summary: This study investigates how varying levels of prompt politeness affect model accuracy on multiple-choice questions.<n>We created a dataset of 50 base questions spanning mathematics, science, and history, each rewritten into five tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude.<n> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The wording of natural language prompts has been shown to influence the performance of large language models (LLMs), yet the role of politeness and tone remains underexplored. In this study, we investigate how varying levels of prompt politeness affect model accuracy on multiple-choice questions. We created a dataset of 50 base questions spanning mathematics, science, and history, each rewritten into five tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude, yielding 250 unique prompts. Using ChatGPT 4o, we evaluated responses across these conditions and applied paired sample t-tests to assess statistical significance. Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. Our results highlight the importance of studying pragmatic aspects of prompting and raise broader questions about the social dimensions of human-AI interaction.

Related papers

Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA [0.6263481844384227]
This work proposes a systematic evaluation framework to examine how interaction tone affects model accuracy.<n>We apply this framework to three recently released and widely available large language models: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta)<n>Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks.
arXiv Detail & Related papers (2025-12-14T19:25:20Z)
CAPE: Context-Aware Personality Evaluation Framework for Large Language Models [8.618075786777219]
We propose the first Context-Aware Personality Evaluation framework for Large Language Models (LLMs)<n>Our experiments reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts.<n>Our framework can be applied to Role Playing Agents (RPAs) to better align with human judgments.
arXiv Detail & Related papers (2025-08-28T03:17:47Z)
What Makes a Good Natural Language Prompt? [72.3282960118995]
We conduct a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025.<n>We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions.<n>We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact.
arXiv Detail & Related papers (2025-06-07T23:19:27Z)
Revealing Fine-Grained Values and Opinions in Large Language Models [40.53709870111704]
We analyse a dataset of 156k responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations.<n>For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts.<n>We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses.
arXiv Detail & Related papers (2024-06-27T15:01:53Z)
Are Large Language Models Chameleons? An Attempt to Simulate Social Surveys [1.5727456947901746]
We conducted millions of simulations in which large language models (LLMs) were asked to answer subjective questions. A comparison of different LLM responses with the European Social Survey (ESS) data suggests that the effect of prompts on bias and variability is fundamental.
arXiv Detail & Related papers (2024-05-29T17:54:22Z)
Large Language Models Can Infer Personality from Free-Form User Interactions [0.0]
GPT-4 can infer personality with moderate accuracy, outperforming previous approaches. Results show that the direct focus on personality assessment did not result in a less positive user experience. Preliminary analyses suggest that the accuracy of personality inferences varies only marginally across different socio-demographic subgroups.
arXiv Detail & Related papers (2024-05-19T20:33:36Z)
Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content. We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z)
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models. We show that models give substantively different answers when not forced. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z)
You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments [37.03210795084276]
We examine whether the current format of prompting Large Language Models elicits responses in a consistent and robust manner. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model's question-answering ability. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions.
arXiv Detail & Related papers (2023-11-16T09:50:53Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
Toward Human Readable Prompt Tuning: Kubrick's The Shining is a good movie, and a good prompt too? [84.91689960190054]
Large language models can perform new tasks in a zero-shot fashion, given natural language prompts. It is underexplored what factors make the prompts effective, especially when the prompts are natural language.
arXiv Detail & Related papers (2022-12-20T18:47:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.