Related papers: Natural Context Drift Undermines the Natural Language Understanding of Large Language Models

Natural Context Drift Undermines the Natural Language Understanding of Large Language Models

URL: http://arxiv.org/abs/2509.01093v1
Date: Mon, 01 Sep 2025 03:32:50 GMT
Title: Natural Context Drift Undermines the Natural Language Understanding of Large Language Models
Authors: Yulong Wu, Viktor Schlegel, Riza Batista-Navarro,
Abstract summary: We propose a framework for curating naturally evolved, human-edited variants of reading passages from contemporary QA benchmarks.<n>Using this framework, we evaluate six QA datasets and eight Large Language Models with publicly available training data.<n>Experiments reveal that LLM performance declines as reading passages naturally diverge from the versions encountered during pretraining.
Score: 17.3648708521395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How does the natural evolution of context paragraphs affect question answering in generative Large Language Models (LLMs)? To investigate this, we propose a framework for curating naturally evolved, human-edited variants of reading passages from contemporary QA benchmarks and for analyzing LLM performance across a range of semantic similarity scores, which quantify how closely each variant aligns with content seen during pretraining. Using this framework, we evaluate six QA datasets and eight LLMs with publicly available training data. Our experiments reveal that LLM performance declines as reading passages naturally diverge from the versions encountered during pretraining-even when the question and all necessary information remains present at inference time. For instance, average model accuracy on BoolQ drops by over 30% from the highest to lowest similarity bins, with slopes exceeding 70 across several LLMs. These findings suggest that natural text evolution poses a significant challenge to the language understanding capabilities of LLMs.

Related papers

Are the LLMs Capable of Maintaining at Least the Language Genus? [5.748049484273442]
We show that genus-level effects are present but strongly conditioned by training resource availability.<n>Our findings suggest that LLMs encode aspects of genus-level structure, but training data imbalances remain the primary factor shaping their multilingual performance.
arXiv Detail & Related papers (2025-10-24T15:20:40Z)
How Do LLM-Generated Texts Impact Term-Based Retrieval Models? [76.92519309816008]
This paper investigates the influence of large language models (LLMs) on term-based retrieval models.<n>Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes.<n>Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries.
arXiv Detail & Related papers (2025-08-25T06:43:27Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.<n>We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.<n>Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z)
CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs [1.515687944002438]
We propose Contrastive Semantic Similarity, a module to obtain similarity features for measuring uncertainty for text pairs. We conduct extensive experiments with three large language models (LLMs) on several benchmark question-answering datasets. Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines.
arXiv Detail & Related papers (2024-06-05T11:35:44Z)
Uncertainty Quantification for In-Context Learning of Large Language Models [52.891205009620364]
In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs) We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties. The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion.
arXiv Detail & Related papers (2024-02-15T18:46:24Z)
Pushing The Limit of LLM Capacity for Text Classification [27.684335455517417]
We propose RGPT, an adaptive boosting framework tailored to produce a specialized text classification LLM. We show that RGPT significantly outperforms 8 SOTA PLMs and 7 SOTA LLMs on four benchmarks by 1.36% on average.
arXiv Detail & Related papers (2024-02-12T08:14:03Z)
LLMs May Perform MCQA by Selecting the Least Incorrect Option [29.202758753639078]
Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks.<n>The adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction.<n>However, concerns regarding the robustness of this evaluative method persist.
arXiv Detail & Related papers (2024-02-02T12:07:00Z)
Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models [91.6543868677356]
The evolution of Neural Machine Translation has been influenced by six core challenges. These challenges include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models.
arXiv Detail & Related papers (2024-01-16T13:30:09Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models. It achieves consistent and correct step-wise prompts in zero-shot scenarios. We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z)
Using Natural Language Explanations to Improve Robustness of In-context Learning [35.18010811754959]
Large language models (LLMs) can excel in many tasks via in-context learning (ICL) We investigate whether augmenting ICL with natural language explanations (NLEs) improves the robustness of LLMs on adversarial datasets.
arXiv Detail & Related papers (2023-11-13T18:49:13Z)
MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models [17.322480769274062]
Large language models (LLMs) have shown nearly saturated performance on many natural language processing (NLP) tasks. This paper constructs Multiple Sensitive Factors Time QA (MenatQA) with total 2,853 samples for evaluating the time comprehension and reasoning abilities of LLMs.
arXiv Detail & Related papers (2023-10-08T13:19:52Z)
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.