CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- URL: http://arxiv.org/abs/2511.18659v2
- Date: Tue, 25 Nov 2025 22:02:20 GMT
- Title: CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- Authors: Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang,
- Abstract summary: CLaRa is a unified framework that performs embedding-based compression and joint optimization in a shared continuous space.<n> Experiments show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
- Score: 34.38636514331703
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
Related papers
- Latent Context Compilation: Distilling Long Context into Compact Portable Memory [13.768393657432027]
We propose Latent Context Compilation, a framework that shifts context processing from adaptation to compilation.<n>By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens.<n> Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities.
arXiv Detail & Related papers (2026-01-31T08:38:07Z) - GrepRAG: An Empirical Study and Optimization of Grep-Like Retrieval for Code Completion [32.17127975368661]
Repository-level code completion remains challenging for large language models.<n>We investigate lightweight, index-free, intent-aware lexical retrieval.<n>We introduce Naive GrepRAG, a baseline framework in which LLMs autonomously generate ripweighted commands to retrieve relevant context.
arXiv Detail & Related papers (2026-01-30T18:22:15Z) - Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z) - RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse [39.76548092849437]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context.<n>Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality.<n>We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse.
arXiv Detail & Related papers (2025-11-05T13:59:01Z) - CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling [52.05149789178508]
CCF is a novel context compression framework designed to enable efficient long-context modeling.<n>CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations.<n> Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios.
arXiv Detail & Related papers (2025-09-11T07:13:49Z) - Retrieval-augmented reasoning with lean language models [5.615564811138556]
We develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries.<n>Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models.<n>All implementation details and code are publicly released to support and adaptation across domains.
arXiv Detail & Related papers (2025-08-15T10:38:15Z) - Scalable In-Context Q-Learning [68.9917436397079]
We propose textbfScalable textbfIn-textbfContext textbfQ-textbfLearning (textbfSICQL) to steer in-context reinforcement learning.<n>textbfSICQL harnesses dynamic programming and world modeling to steer ICRL toward efficient reward and task generalization.
arXiv Detail & Related papers (2025-06-02T04:21:56Z) - Chain-of-Retrieval Augmented Generation [91.02950964802454]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.<n>Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z) - Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks [11.053340674721005]
Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources.<n>This paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.
arXiv Detail & Related papers (2024-12-20T06:58:32Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token [108.7069350303884]
xRAG is an innovative context compression method tailored for retrieval-augmented generation.<n>xRAG seamlessly integrates document embeddings into the language model representation space.<n> Experimental results demonstrate that xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks.
arXiv Detail & Related papers (2024-05-22T16:15:17Z) - Multiscale Latent-Guided Entropy Model for LiDAR Point Cloud Compression [18.897023700334458]
The non-uniform distribution and extremely sparse nature of the LiDAR point cloud (LPC) bring significant challenges to its high-efficient compression.
This paper proposes a novel end-to-end, fully-factorized deep framework that encodes the original LPC into an octree structure and hierarchically decomposes the octree entropy model in layers.
arXiv Detail & Related papers (2022-09-26T08:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.