When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
- URL: http://arxiv.org/abs/2508.11383v1
- Date: Fri, 15 Aug 2025 10:32:50 GMT
- Title: When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
- Authors: Mikhail Seleznyov, Mikhail Chaichuk, Gleb Ershov, Alexander Panchenko, Elena Tutubalina, Oleg Somov,
- Abstract summary: We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
- Score: 55.20230501807337
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters.
Related papers
- Large Language Model enabled Mathematical Modeling [2.132096006921049]
This research investigates the potential of Large Language Models (LLMs) to bridge the formulation gap using natural language understanding and code generation.<n>DeepSeek-R1 is a cost-efficient and high-performing model trained with reinforcement learning.<n>Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies.
arXiv Detail & Related papers (2025-10-22T17:41:42Z) - MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics [72.00014675808228]
Instability in Large Language Models evaluation process obscures true learning dynamics.<n>We introduce textbfMaP, a framework that integrates underlineMerging underlineand the underlinePass@k metric.<n>Experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent rankings.
arXiv Detail & Related papers (2025-10-10T11:40:27Z) - CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation [22.06897150825726]
We propose CODE-DITING, a novel code evaluation method that balances accuracy, efficiency and explainability.<n>We develop a data distillation framework that effectively transfers reasoning capabilities from DeepSeek-R1671B to our CODE-DITING 1.5B and 7B models.<n>With the majority vote strategy in the inference process, CODE-DITING 1.5B outperforms all models with the same magnitude of parameters.
arXiv Detail & Related papers (2025-05-26T04:29:14Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation [21.850854237079595]
FLUKE is a task-agnostic framework for assessing model robustness through systematic minimal variations of test data.<n>We demonstrate FLUKE's utility by evaluating both fine-tuned models and large language models (LLMs) across four diverse NLP tasks.
arXiv Detail & Related papers (2025-04-24T07:12:37Z) - Breaking Focus: Contextual Distraction Curse in Large Language Models [68.4534308805202]
We investigate a critical vulnerability in Large Language Models (LLMs)<n>This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context.<n>We propose an efficient tree-based search methodology to automatically generate CDV examples.
arXiv Detail & Related papers (2025-02-03T18:43:36Z) - Robustness of Large Language Models Against Adversarial Attacks [5.312946761836463]
We present a comprehensive study on the robustness of GPT LLM family.<n>We employ two distinct evaluation methods to assess their resilience.<n>Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks.
arXiv Detail & Related papers (2024-12-22T13:21:15Z) - A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)<n>We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.<n>Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
We introduce UBench, a new benchmark for evaluating the uncertainty of large language models (LLMs)<n>Unlike other benchmarks, UBench is based on confidence intervals. It encompasses 11,978 multiple-choice questions spanning knowledge, language, understanding, and reasoning capabilities.<n>Our analysis reveals several crucial insights: 1) Our confidence interval-based methods are highly effective for uncertainty quantification; 2) Regarding uncertainty, outstanding open-source models show competitive performance versus closed-source models; 3) CoT and RP prompts present potential ways to improve model reliability, while the influence of temperature changes follows no universal rule.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results.
We develop a method that avoids introducing external resources, relying instead on perturbations to the input.
Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z) - RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning [45.39445027132887]
We introduce RoTBench, a benchmark for evaluating the robustness of Large Language Models in tool learning.
Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning.
We propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning.
arXiv Detail & Related papers (2024-01-16T12:45:15Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.