Towards Reliable Retrieval in RAG Systems for Large Legal Datasets
- URL: http://arxiv.org/abs/2510.06999v1
- Date: Wed, 08 Oct 2025 13:22:20 GMT
- Title: Towards Reliable Retrieval in RAG Systems for Large Legal Datasets
- Authors: Markus Reuter, Tobias Lingenberg, Rūta Liepiņa, Francesca Lagioia, Marco Lippi, Giovanni Sartor, Andrea Passerini, Burcu Sayin,
- Abstract summary: Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs)<n>This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail.<n>We investigate a simple and computationally efficient technique which enhances each text chunk with a document-level synthetic summary.<n>Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.
- Score: 6.376251215279889
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.
Related papers
- Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation [61.47019392413271]
WinnowRAG is designed to systematically filter out noisy documents while preserving valuable content.<n>WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters.<n>In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones.
arXiv Detail & Related papers (2025-11-01T20:08:13Z) - ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search [69.60882125603133]
We present ReliabilityRAG, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents.<n>Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.
arXiv Detail & Related papers (2025-09-27T22:36:42Z) - Fishing for Answers: Exploring One-shot vs. Iterative Retrieval Strategies for Retrieval Augmented Generation [11.180502261031789]
Retrieval-Augmented Generation (RAG) based on Large Language Models (LLMs) is a powerful solution to understand and query the industry's closed-source documents.<n>However, basic RAG often struggles with complex QA tasks in legal and regulatory domains.<n>We explore two strategies to improve evidence coverage and answer quality.
arXiv Detail & Related papers (2025-09-05T05:44:50Z) - Tree-Based Text Retrieval via Hierarchical Clustering in RAGFrameworks: Application on Taiwanese Regulations [0.0]
We propose a hierarchical clustering-based retrieval method that eliminates the need to predefine k.<n>Our approach maintains the accuracy and relevance of system responses while adaptively selecting semantically relevant content.<n>Our framework is simple to implement and easily integrates with existing RAG pipelines, making it a practical solution for real-world applications under limited resources.
arXiv Detail & Related papers (2025-06-16T15:34:29Z) - Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains [32.71308102835446]
Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains.<n>We show that RAG is vulnerable to universal poisoning attacks in medical Q&A.<n>We develop a new detection-based defense to ensure the safe use of RAG.
arXiv Detail & Related papers (2024-09-12T02:43:40Z) - SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - Grounding Language Model with Chunking-Free In-Context Retrieval [27.316315081648572]
This paper presents a novel Chunking-Free In-Context (CFIC) retrieval approach, specifically tailored for Retrieval-Augmented Generation (RAG) systems.
arXiv Detail & Related papers (2024-02-15T07:22:04Z) - CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources.
This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z) - Corrective Retrieval Augmented Generation [36.04062963574603]
Retrieval-augmented generation (RAG) relies heavily on relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong.
We propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation.
CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches.
arXiv Detail & Related papers (2024-01-29T04:36:39Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.