Related papers: RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

URL: http://arxiv.org/abs/2501.00353v1
Date: Tue, 31 Dec 2024 09:00:51 GMT
Title: RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
Authors: Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang,
Abstract summary: RAG-Instruct is a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus.<n>We construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks.<n>Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance.
Score: 25.952471869592443
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.

Related papers

MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval [50.30107119622642]
Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data.<n>Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge.<n>MarAG-R1 is a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms.
arXiv Detail & Related papers (2025-10-31T15:51:39Z)
UR$^2$: Unify RAG and Reasoning through Reinforcement Learning [17.319590573147565]
Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) and Reinforcement Learning from Verifiable Rewards (RLVR)<n>We propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning.<n>Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks demonstrate that UR$2$ significantly outperforms existing RAG and RL methods.
arXiv Detail & Related papers (2025-08-08T09:33:20Z)
Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization [64.33914369424494]
RoleRAG is a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization.<n>RoleRAG comprises six modules, each handling a specific sub-task within the RAG process.<n>We introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state.
arXiv Detail & Related papers (2025-05-21T12:25:12Z)
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving [9.962031642362813]
Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. RAG is a structured abstraction that captures the wide range of RAG algorithms. RAGO is a system optimization framework for efficient RAG serving.
arXiv Detail & Related papers (2025-03-18T18:58:13Z)
mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA [78.45521005703958]
multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge. We propose a novel framework called textbfRetrieval-textbfReftextbfAugmented textbfGeneration (mR$2$AG) which achieves adaptive retrieval and useful information localization. mR$2$AG significantly outperforms state-of-the-art MLLMs on INFOSEEK and Encyclopedic-VQA
arXiv Detail & Related papers (2024-11-22T16:15:50Z)
RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards [78.74923079748521]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) Current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. We propose a Differentiable Data Rewards ( DDR) method, which trains RAG systems by aligning data preferences between different RAG modules.
arXiv Detail & Related papers (2024-10-17T12:53:29Z)
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation [63.611024451010316]
Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. We propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems.
arXiv Detail & Related papers (2024-10-12T16:30:51Z)
RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation [54.707460684650584]
Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention. Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG) RAGLAB is a modular and research-oriented open-source library that reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms.
arXiv Detail & Related papers (2024-08-21T07:20:48Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation [26.158785168036662]
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments.
arXiv Detail & Related papers (2024-07-01T09:09:27Z)
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems [0.0]
Retrieval-Augmented Generation (RAG) has become a standard architectural pattern for domain-specific knowledge into user-facing chat applications. We introduce RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k examples. We formalize the TRACe evaluation framework: a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains.
arXiv Detail & Related papers (2024-06-25T20:23:15Z)
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs [18.581518952488093]
Multi-Head RAG (MRAG) is a novel scheme for fetching multi-aspect documents.<n>We show MRAG's design advantages over 18 RAG baselines, empirical improvements of up to 20% in retrieval success ratios.
arXiv Detail & Related papers (2024-06-07T16:59:38Z)
FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention. Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers. Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z)
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.