Related papers: Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

URL: http://arxiv.org/abs/2510.13975v1
Date: Wed, 15 Oct 2025 18:02:30 GMT
Title: Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems
Authors: Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Rose, Jesse C. Cresswell,
Abstract summary: Retrieval-augmented generation (RAG) is a prevalent approach for building question-answering systems.<n>Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs.<n>We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them.
Score: 10.899541303791928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.

Related papers

RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation [1.564663326217051]
RAGVUE is a framework for evaluation of Retrieval-Augmented Generation (RAG) systems.<n>It decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration.<n> RAGVUE supports both manual metric selection and fully automated agentic evaluation.
arXiv Detail & Related papers (2025-12-03T07:42:49Z)
Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries [53.99620546358492]
Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete.<n>Existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions.<n>We present the first pipeline for automatic, difficulty-controlled creation of un$underlinec$heatable, $underliner$ealistic, $underlineu$nanswerable, and $underlinem$ulti-hop.
arXiv Detail & Related papers (2025-10-13T21:38:04Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
Towards Automated Error Discovery: A Study in Conversational AI [48.735443116662026]
We introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI.<n>We also propose SEEED (Soft Clustering Extended-Based Error Detection), as an encoder-based approach to its implementation.
arXiv Detail & Related papers (2025-09-13T14:53:22Z)
DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router [57.28685457991806]
DeepSieve is an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router.<n>Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design.
arXiv Detail & Related papers (2025-07-29T17:55:23Z)
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training. We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Mitigating the Impact of Labeling Errors on Training via Rockafellian Relaxation [0.8741284539870512]
We propose a new loss reweighting, architecture-independent methodology, Rockafellian Relaxation Method (RRM) for neural network training.<n>Experiments indicate RRM can enhance neural network methods to achieve robust performance across classification tasks in computer vision and natural language processing (sentiment analysis)<n>We find that RRM can mitigate the effects of dataset contamination stemming from both (heavy) labeling error and/or adversarial perturbation, demonstrating effectiveness across a variety of data domains and machine learning tasks.
arXiv Detail & Related papers (2024-05-30T23:13:01Z)
Seven Failure Points When Engineering a Retrieval Augmented Generation System [1.8776685617612472]
RAG systems aim to reduce the problem of hallucinated responses from large language models. RAG systems suffer from limitations inherent to information retrieval systems. We present an experience report on the failure points of RAG systems from three case studies.
arXiv Detail & Related papers (2024-01-11T12:04:11Z)
Discovering and Validating AI Errors With Crowdsourced Failure Reports [10.4818618376202]
We introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect AI errors. We also design and implement Deblinder, a visual analytics system for synthesizing failure reports. In semi-structured interviews and think-aloud studies with 10 AI practitioners, we explore the affordances of the Deblinder system and the applicability of failure reports in real-world settings.
arXiv Detail & Related papers (2021-09-23T23:26:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.