IRB: Automated Generation of Robust Factuality Benchmarks
- URL: http://arxiv.org/abs/2602.08070v1
- Date: Sun, 08 Feb 2026 18:01:43 GMT
- Title: IRB: Automated Generation of Robust Factuality Benchmarks
- Authors: Lam Thanh Do, Bhagyashree Taleka, Hozaifa Ammar Bhutta, Vikram Sharma Mailthody, Kevin Chen-Chuan Chang, Wen-mei Hwu,
- Abstract summary: IRB is a framework for automatically generating benchmarks to evaluate the factuality of RAG systems.<n>We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers.
- Score: 21.23127687776739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
Related papers
- Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration [14.88759517020146]
Existing RAG pipelines incur substantial retrieval and generation overhead when applied to large-scale entity matching.<n>We introduce CE-RAG4EM, a cost-efficient RAG architecture that reduces computation through blocking-based batch retrieval and generation.
arXiv Detail & Related papers (2026-02-05T14:33:00Z) - SSR: Socratic Self-Refine for Large Language Model Reasoning [78.62319252287938]
Socratic Self-Refine (SSR) is a novel framework for fine-grained evaluation and precise refinement of Large Language Models (LLMs)<n>Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation.<n> Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines.
arXiv Detail & Related papers (2025-11-13T18:47:07Z) - Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards [14.535325886547112]
RAG systems are increasingly deployed in high-stakes domains where users expect outputs to be consistent across semantically equivalent queries.<n>Existing systems often exhibit significant inconsistencies due to variability in both the retriever and generator.<n>We introduce a principled evaluation framework that decomposes RAG consistency into retriever-level, generator-level, and end-to-end components.
arXiv Detail & Related papers (2025-10-05T23:14:13Z) - PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation [15.230902967865925]
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge.<n>Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization.<n>Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems.
arXiv Detail & Related papers (2025-07-23T16:14:08Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - Semantic Tokens in Retrieval Augmented Generation [0.0]
I propose a novel Comparative RAG system that introduces an evaluator module to bridge the gap between probabilistic RAG systems and deterministically verifiable responses.<n>This framework paves the way for more reliable and scalable question-answering applications in domains requiring high precision and verifiability.
arXiv Detail & Related papers (2024-12-03T16:52:06Z) - Toward General Instruction-Following Alignment for Retrieval-Augmented Generation [63.611024451010316]
Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems.
We propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems.
arXiv Detail & Related papers (2024-10-12T16:30:51Z) - SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance.
We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination.
We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - RAGGED: Towards Informed Design of Scalable and Stable RAG Systems [51.171355532527365]
Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge.<n>RAGGED is a framework for systematically evaluating RAG systems.
arXiv Detail & Related papers (2024-03-14T02:26:31Z) - Ragas: Automated Evaluation of Retrieval Augmented Generation [23.253210019550377]
Ragas is a framework for evaluation of Retrieval Augmented Generation pipelines.<n>RAG systems are composed of a retrieval and an LLM based generation module.
arXiv Detail & Related papers (2023-09-26T19:23:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.