Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
- URL: http://arxiv.org/abs/2504.01281v2
- Date: Thu, 03 Apr 2025 01:23:22 GMT
- Title: Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
- Authors: Sakhinana Sagar Srinivas, Venkataramana Runkana,
- Abstract summary: We present a framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning.<n>Our framework integrates two complementary techniques: Policy-d RetrievalAugmented Generation (PORAG) and Adaptive Token-Layer Attention Scoring (ATLAS)<n>Our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS), which dynamically determines retrieval timing and content based on contextual needs. Together, these techniques enhance both the utilization and relevance of retrieved content, improving factual accuracy and response quality. Designed as a lightweight solution compatible with any Transformer-based LLM without requiring additional training, our framework excels in knowledge-intensive tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a novel method to selectively compress key-value caches by token importance, mitigating memory bottlenecks in long-context applications. The framework also incorporates test-time scaling techniques to dynamically balance reasoning depth and computational resources, alongside optimized decoding strategies for faster inference. Experiments on benchmark datasets show that our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems. This integrated approach advances the development of robust, efficient, and scalable RAG systems across diverse applications.
Related papers
- Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models [2.9687381456164004]
It proposes a state-aware dynamic knowledge retrieval mechanism to enhance semantic understanding and knowledge scheduling efficiency.
The proposed structure is thoroughly evaluated across different large models, including GPT-4, GPT-4o, and DeepSeek.
The approach also demonstrates stronger robustness and generation consistency in tasks involving semantic ambiguity and multi-document fusion.
arXiv Detail & Related papers (2025-04-28T02:50:45Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.<n>We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning through RAG and Incremental Knowledge Graph Learning Integration [4.604003661048267]
RAG-KG-IL is a novel multi-agent hybrid framework designed to enhance the reasoning capabilities of Large Language Models.
It integrates Retrieval-Augmented Generation (RAG) and Knowledge Graphs (KGs) with an Incremental Learning (IL) approach.
We evaluate the framework using real-world case studies involving health-related queries.
arXiv Detail & Related papers (2025-03-14T11:50:16Z) - Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control [52.405085773954596]
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to mitigate large language model hallucinations.<n>Existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving.<n>We introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off.
arXiv Detail & Related papers (2025-02-17T18:56:20Z) - Adapting to Non-Stationary Environments: Multi-Armed Bandit Enhanced Retrieval-Augmented Generation on Knowledge Graphs [23.357843519762483]
Recent studies have demonstrated that leveraging the Retrieval-Augmented Generation framework, combined with Knowledge Graphs, robustly enhances the reasoning capabilities of Large language models.<n>We introduce a Multi-objective Multi-Armed Bandit enhanced RAG framework, supported by multiple retrieval methods with diverse capabilities.<n>Our method significantly outperforms baseline methods in non-stationary settings while achieving state-of-the-art performance in stationary environments.
arXiv Detail & Related papers (2024-12-10T15:56:03Z) - LightRAG: Simple and Fast Retrieval-Augmented Generation [12.86888202297654]
Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources.
Existing RAG systems have significant limitations, including reliance on flat data representations and inadequate contextual awareness.
We propose LightRAG, which incorporates graph structures into text indexing and retrieval processes.
arXiv Detail & Related papers (2024-10-08T08:00:12Z) - WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs [10.380692079063467]
We propose WeKnow-RAG, which integrates Web search and Knowledge Graphs into a "Retrieval-Augmented Generation (RAG)" system.
First, the accuracy and reliability of LLM responses are improved by combining the structured representation of Knowledge Graphs with the flexibility of dense vector retrieval.
Our approach effectively balances the efficiency and accuracy of information retrieval, thus improving the overall retrieval process.
arXiv Detail & Related papers (2024-08-14T15:19:16Z) - RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems [51.171355532527365]
Retrieval-augmented generation (RAG) can significantly improve the performance of language models (LMs)
RAGGED is a framework for analyzing RAG configurations across various document-based question answering tasks.
arXiv Detail & Related papers (2024-03-14T02:26:31Z) - DNS-Rec: Data-aware Neural Architecture Search for Recommender Systems [79.76519917171261]
This paper addresses the computational overhead and resource inefficiency prevalent in Sequential Recommender Systems (SRSs)<n>We introduce an innovative approach combining pruning methods with advanced model designs.<n>Our principal contribution is the development of a Data-aware Neural Architecture Search for Recommender System (DNS-Rec)
arXiv Detail & Related papers (2024-02-01T07:22:52Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.