DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2505.10493v2
- Date: Sat, 04 Oct 2025 03:57:52 GMT
- Title: DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation
- Authors: Shaohan Wang, Licheng Zhang, Zheren Fu, Zhendong Mao, Yongdong Zhang,
- Abstract summary: We introduce DACL-RAG, a multi-stage RAG training framework that combines a multi-level Data Augmentation strategy and a multi-stage Curriculum Learning paradigm.<n>Our framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.
- Score: 54.26665681604041
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods typically optimize the retriever or the generator in a RAG system by directly using the top-k retrieved documents. However, two key issues inherent in the training data constrain the effectiveness of this training paradigm: (1) across different queries, the top-k retrieved documents vary greatly in content quality, with some providing valuable knowledge while others lack critical information or are even misleading, and training on such data in a purely random manner may impair the generator's ability to extract key information; (2) for a given query, the limited set of k documents often exhibits low discriminability, and training solely on them makes it difficult for the retriever to learn how to distinguish between relevant and irrelevant documents. To address these issues, we introduce DACL-RAG, a multi-stage RAG training framework that combines a multi-level Data Augmentation strategy with a multi-stage Curriculum Learning paradigm. The data augmentation strategy constructs comprehensive and diverse training sets with controllable difficulty levels through sample evolution, while the curriculum learning paradigm organizes them into progressive stages for training, ensuring stable and consistent improvements, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our DACL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.
Related papers
- Multi-hop Reasoning via Early Knowledge Alignment [68.28168992785896]
Early Knowledge Alignment (EKA) aims to align Large Language Models with contextually relevant retrieved knowledge.<n>EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency.<n>EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models.
arXiv Detail & Related papers (2025-12-23T08:14:44Z) - RouteRAG: Efficient Retrieval-Augmented Generation from Text and Graph via Reinforcement Learning [69.87510139069218]
Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs)<n>Recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL)<n>We introduce model, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG.
arXiv Detail & Related papers (2025-12-10T10:05:31Z) - MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval [50.30107119622642]
Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data.<n>Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge.<n>MarAG-R1 is a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms.
arXiv Detail & Related papers (2025-10-31T15:51:39Z) - Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards [50.21528417884747]
We introduce Omni-Thinker, a unified reinforcement learning framework that enhances large language models (LLMs) performance across diverse tasks.<n>Our approach enables consistent optimization across task types and scales RL-based training to subjective domains.<n> Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging.
arXiv Detail & Related papers (2025-07-20T01:50:16Z) - LTRR: Learning To Rank Retrievers for LLMs [53.285436927963865]
We show that routing-based RAG systems can outperform the best single-retriever-based systems.<n>Performance gains are especially pronounced in models trained with the Answer Correctness (AC) metric.<n>As part of the SIGIR 2025 LiveRAG challenge, our submitted system demonstrated the practical viability of our approach.
arXiv Detail & Related papers (2025-06-16T17:53:18Z) - KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG [63.82127103851471]
Retrieval-Augmented Generation (RAG) enables large language models to access broader knowledge sources.<n>We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance.<n>We present KARE-RAG, which improves knowledge utilization through three key innovations.
arXiv Detail & Related papers (2025-06-03T06:31:17Z) - Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning [45.10424242207931]
Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs)<n>We introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for query generation, evidence extraction, and answer generation.<n>With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers.
arXiv Detail & Related papers (2025-05-20T08:21:00Z) - Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z) - Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing [4.874077691069634]
RAG has shown strong capability in enhancing language models' knowledge and reducing AI generative hallucinations.<n>Current multi-round RAG systems may continue searching even when enough information has already been retrieved.<n>This paper introduces a new framework, textbfSIM-RAG, to explicitly enhance RAG systems' self-awareness and multi-round retrieval capabilities.
arXiv Detail & Related papers (2025-05-05T17:39:35Z) - Revisiting Robust RAG: Do We Still Need Complex Robust Training in the Era of Powerful LLMs? [69.38149239733994]
We investigate whether complex robust training strategies remain necessary as model capacity grows.<n>We find that as models become more powerful, the performance gains brought by complex robust training methods drop off dramatically.<n>Our findings suggest that RAG systems can benefit from simpler architectures and training strategies as models become more powerful.
arXiv Detail & Related papers (2025-02-17T03:34:31Z) - RAG-Reward: Optimizing RAG with Reward Modeling and RLHF [8.911260109659489]
Retrieval-augmented generation (RAG) enhances Large Language Models (LLMs) with relevant and up-to-date knowledge.<n>The role of reward models in reinforcement learning for optimizing RAG remains underexplored.<n>We introduce textbfRAG-Reward, a framework designed to develop reward models.
arXiv Detail & Related papers (2025-01-22T22:59:19Z) - Fine-Grained Guidance for Retrievers: Leveraging LLMs' Feedback in Retrieval-Augmented Generation [20.420575358183687]
Retrieval-Augmented Generation (RAG) has proven to be an effective method for mitigating hallucination issues inherent in large language models (LLMs)
Previous approaches typically train retrievers based on semantic similarity, lacking optimization for RAG.
We propose a novel framework, FiGRet, which leverages the language capabilities of LLMs to construct examples from a more granular, information-centric perspective.
arXiv Detail & Related papers (2024-11-06T14:42:39Z) - RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards [78.74923079748521]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources.<n>Current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge.<n>We propose a Differentiable Data Rewards ( DDR) method, which trains RAG systems by aligning data preferences between different RAG modules.
arXiv Detail & Related papers (2024-10-17T12:53:29Z) - RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation [8.377398103067508]
We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases.
RAG Foundry integrates data creation, training, inference and evaluation into a single workflow.
We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations.
arXiv Detail & Related papers (2024-08-05T15:16:24Z) - FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention.<n>Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers.<n>Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z) - Retrieval as Attention: End-to-end Learning of Retrieval and Reading
within a Single Transformer [80.50327229467993]
We show that a single model trained end-to-end can achieve both competitive retrieval and QA performance.
We show that end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings.
arXiv Detail & Related papers (2022-12-05T04:51:21Z) - Hybrid Generative-Retrieval Transformers for Dialogue Domain Adaptation [77.62366712130196]
We present the winning entry at the fast domain adaptation task of DSTC8, a hybrid generative-retrieval model based on GPT-2 fine-tuned to the multi-domain MetaLWOz dataset.
Our model uses retrieval logic as a fallback, being SoTA on MetaLWOz in human evaluation (>4% improvement over the 2nd place system) and attaining competitive generalization performance in adaptation to the unseen MultiWOZ dataset.
arXiv Detail & Related papers (2020-03-03T18:07:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.