Related papers: Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

URL: http://arxiv.org/abs/2602.17316v1
Date: Thu, 19 Feb 2026 12:24:42 GMT
Title: Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
Authors: Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser,
Abstract summary: This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary Large Language Models (LLMs)<n>Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results.
Score: 40.210132040677
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.

Related papers

Same Answer, Different Representations: Hidden instability in VLMs [65.36933543377346]
We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness.<n>We apply this framework to modern Vision Language Models (VLMs) across the SEEDBench, MMMU, and POPE datasets.
arXiv Detail & Related papers (2026-02-06T12:24:26Z)
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs [6.121856629864516]
Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag.<n>Real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query.<n>We systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities.
arXiv Detail & Related papers (2025-09-04T08:43:27Z)
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation [24.39952838336609]
FLUKE is a framework for assessing model robustness through systematic minimal variations of test data.<n>We demonstrate FLUKE's utility by evaluating both fine-tuned models and large language models (LLMs) across six diverse NLP tasks.
arXiv Detail & Related papers (2025-04-24T07:12:37Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning.<n>We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics.<n>Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z)
Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z)
Forget What You Know about LLMs Evaluations -- LLMs are Like a Chameleon [12.13060272830352]
Large language models (LLMs) often appear to excel on public benchmarks.<n>These high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding.<n>We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that distorts benchmark prompts.
arXiv Detail & Related papers (2025-02-11T10:43:36Z)
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Semantic Consistency for Assuring Reliability of Large Language Models [9.040736633675136]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.<n>We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.<n>We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.