Related papers: Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance

Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance

URL: http://arxiv.org/abs/2502.12459v1
Date: Tue, 18 Feb 2025 02:42:53 GMT
Title: Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance
Authors: Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Jinzhu Wu, Yuhan Wu, Change Jia, Lin Sun, Xiangzheng Zhang,
Abstract summary: This paper investigates the fragility of Large Language Models (LLMs) in generalizing to novel inputs.<n>Despite high benchmark scores, LLMs exhibit significant accuracy drops and unexpected biases when faced with minor but content-preserving modifications.
Score: 5.8538128016098225
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper investigates the fragility of Large Language Models (LLMs) in generalizing to novel inputs, specifically focusing on minor perturbations in well-established benchmarks (e.g., slight changes in question format or distractor length). Despite high benchmark scores, LLMs exhibit significant accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT-4 experiences a 25-point accuracy loss when question types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts. This work aligns with the ACL 2025 theme track on the Generalization of NLP models, proposing a "Generalization Stress Test" to assess performance shifts under controlled perturbations. The study calls for reevaluating benchmarks and developing more reliable evaluation methodologies to capture LLM generalization abilities better.

Related papers

Evaluating the Sensitivity of LLMs to Prior Context [2.377922603550519]
Large language models (LLMs) are increasingly deployed in multi-turn dialogue and other sustained interactive scenarios.<n>We introduce a novel set of benchmarks that vary the volume and nature of prior context to measure sensitivity to contextual variations.<n>Our findings reveal that LLM performance on multiple-choice questions can degrade dramatically in multi-turn interactions.
arXiv Detail & Related papers (2025-05-29T16:09:32Z)
Debiased Prompt Tuning in Vision-Language Model without Annotations [14.811475313694041]
Vision-Language Models (VLMs) may suffer from the problem of spurious correlations. By leveraging pseudo-spurious attribute annotations, we propose a method to automatically adjust the training weights of different groups. Our approach efficiently improves the worst-group accuracy on CelebA, Waterbirds, and MetaShift datasets.
arXiv Detail & Related papers (2025-03-11T12:24:54Z)
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks [0.9831489366502301]
We introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts.<n>Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish.<n>Results show that all models experience remarkable accuracy drops, with an average loss of 57% on MMLU and 50% on UNED-Access 2024.
arXiv Detail & Related papers (2025-02-18T14:32:44Z)
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon [11.753349115726952]
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues.<n>We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that distorts benchmark prompts.<n>By rephrasing inputs while preserving semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns.
arXiv Detail & Related papers (2025-02-11T10:43:36Z)
Train More Parameters But Mind Their Placement: Insights into Language Adaptation with PEFT [0.8702432681310401]
We aim to enhance the generation performance of an LLM by specialising it using unstructured text corpora.<n>We find that increasing the number of trainable parameters leads to better and more robust language adaptation.<n>Although improvements are consistent in 0-shot summarisation, some adapted models struggle with longer context lengths.
arXiv Detail & Related papers (2024-12-17T08:44:00Z)
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs [69.40865293066885]
We present the first systematic evaluation examining format bias in performance of large language models (LLMs) We present our empirical format bias evaluation spanning four commonly used categories -- multiple-choice question-answer, wrapping, list, and mapping.
arXiv Detail & Related papers (2024-08-16T10:45:45Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
Large Language Models (LLMs) are often described as instances of foundation models that possess strong generalization obeying scaling laws. We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models claiming strong function. We also observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations.
arXiv Detail & Related papers (2024-06-04T07:43:33Z)
You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments [37.03210795084276]
We examine whether the current format of prompting Large Language Models elicits responses in a consistent and robust manner. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model's question-answering ability. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions.
arXiv Detail & Related papers (2023-11-16T09:50:53Z)
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment [82.60594940370919]
We propose the FlipFlop experiment to study the multi-turn behavior of Large Language Models (LLMs) We show that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect) We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely.
arXiv Detail & Related papers (2023-11-14T23:40:22Z)
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings. We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.