RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
- URL: http://arxiv.org/abs/2602.17366v1
- Date: Thu, 19 Feb 2026 13:49:39 GMT
- Title: RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
- Authors: Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao,
- Abstract summary: Long-tail question answering presents significant challenges for large language models.<n>Retrieval-augmented generation systems have shown great promise in mitigating this limitation.<n>We introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn data.
- Score: 17.510683145248233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.
Related papers
- ARK: Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning [17.026973494557303]
We propose a novel fine-tuning framework that optimize the retriever for Answer Alignment.<n>We first identify high-quality positive chunks by evaluating their sufficiency to generate the correct answer.<n>We then employ a curriculum-based contrastive learning scheme to fine-tune the retriever.
arXiv Detail & Related papers (2025-11-20T13:05:09Z) - MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval [50.30107119622642]
Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data.<n>Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge.<n>MarAG-R1 is a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms.
arXiv Detail & Related papers (2025-10-31T15:51:39Z) - Optimizing Retrieval for RAG via Reinforced Contrastive Learning [10.119882685486427]
Retrieval-augmented generation (RAG) is shifting from retrieving information for human users to retrieving contextual knowledge for AI systems.<n>We propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning.<n>R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%.
arXiv Detail & Related papers (2025-10-28T17:18:30Z) - Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization [56.97588709890706]
LongMab-PO is a novel framework that generates high-quality and diverse responses for long-context modeling tasks.<n> Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs.
arXiv Detail & Related papers (2025-08-19T16:33:55Z) - LTRR: Learning To Rank Retrievers for LLMs [53.285436927963865]
We show that routing-based RAG systems can outperform the best single-retriever-based systems.<n>Performance gains are especially pronounced in models trained with the Answer Correctness (AC) metric.<n>As part of the SIGIR 2025 LiveRAG challenge, our submitted system demonstrated the practical viability of our approach.
arXiv Detail & Related papers (2025-06-16T17:53:18Z) - Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation [77.10390725623125]
Long-form question answering (LFQA) presents unique challenges for large language models.<n>RioRAG is a novel reinforcement learning framework that advances long-form RAG through reinforced informativeness optimization.
arXiv Detail & Related papers (2025-05-27T07:34:41Z) - Pseudo Relevance Feedback is Enough to Close the Gap Between Small and Large Dense Retrieval Models [29.934928091542375]
Scaling dense retrievers to larger large language model (LLM) backbones has been a dominant strategy for improving their retrieval effectiveness.<n>We introduce PromptPRF, a feature-based pseudo-relevance feedback (PRF) framework that enables small LLM-based dense retrievers to achieve effectiveness comparable to much larger models.
arXiv Detail & Related papers (2025-03-19T04:30:20Z) - Dynamic Data Pruning for Automatic Speech Recognition [58.95758272440217]
We introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers fine-grained pruning granularities specifically tailored for speech-related datasets.
Our experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss.
arXiv Detail & Related papers (2024-06-26T14:17:36Z) - RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation [42.82192656794179]
Large Language Models (LLMs) exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses.
This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in unseen scenarios.
Retrieval-Augmented Generation (RAG) addresses this by incorporating external, relevant documents into the response generation process.
arXiv Detail & Related papers (2024-03-31T08:58:54Z) - Learning to Retrieve Passages without Supervision [58.31911597824848]
Dense retrievers for open-domain question answering (ODQA) have been shown to achieve impressive performance by training on large datasets of question-passage pairs.
We investigate whether dense retrievers can be learned in a self-supervised fashion, and applied effectively without any annotations.
arXiv Detail & Related papers (2021-12-14T19:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.