Out of Style: RAG's Fragility to Linguistic Variation
- URL: http://arxiv.org/abs/2504.08231v1
- Date: Fri, 11 Apr 2025 03:30:26 GMT
- Title: Out of Style: RAG's Fragility to Linguistic Variation
- Authors: Tianyu Cao, Neel Bhandari, Akhila Yerukola, Akari Asai, Maarten Sap,
- Abstract summary: User queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components.<n>We analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance.
- Score: 29.59506089890902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely underexplored. This presents a critical gap for practical deployment, where user queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. In this work, we systematically analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance. We evaluate two retrieval models and nine LLMs, ranging from 3 to 72 billion parameters, across four information-seeking Question Answering (QA) datasets. Our results reveal that linguistic reformulations significantly impact both retrieval and generation stages, leading to a relative performance drop of up to 40.41% in Recall@5 scores for less formal queries and 38.86% in answer match scores for queries containing grammatical errors. Notably, RAG systems exhibit greater sensitivity to such variations compared to LLM-only generations, highlighting their vulnerability to error propagation due to linguistic shifts. These findings highlight the need for improved robustness techniques to enhance reliability in diverse user interactions.
Related papers
- Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations [65.11348389219887]
We introduce Dialectic-RAG (DRAG), a modular approach that evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives.
We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models.
arXiv Detail & Related papers (2025-04-07T06:55:15Z) - QE-RAG: A Robust Retrieval-Augmented Generation Benchmark for Query Entry Errors [23.225358970952197]
Retriever-augmented generation (RAG) has become a widely adopted approach for enhancing the factual accuracy of large language models (LLMs)<n>QE-RAG is the first robust RAG benchmark designed specifically to evaluate performance against query entry errors.<n>We propose a contrastive learning-based robust retriever training method and a retrieval-augmented query correction method.
arXiv Detail & Related papers (2025-04-05T05:24:08Z) - AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification [25.27444694706659]
We present AskToAct, which exploits structural mapping between queries and their tool invocation solutions.<n>Our key insight is that tool parameters naturally represent explicit user intents.<n>By systematically removing key parameters from queries while retaining them as ground truth, we enable automated construction of high-quality training data.
arXiv Detail & Related papers (2025-03-03T12:55:49Z) - Beyond Words: How Large Language Models Perform in Quantitative Management Problem-Solving [0.0]
This study examines how Large Language Models (LLMs) perform when tackling quantitative management decision problems in a zero-shot setting.<n>We generated 900 responses generated by five leading models across 20 diverse managerial scenarios.
arXiv Detail & Related papers (2025-02-23T12:39:39Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners [38.30539869264287]
Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks.
However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes.
We introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources.
arXiv Detail & Related papers (2024-03-19T02:34:18Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.