Related papers: Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

URL: http://arxiv.org/abs/2602.10021v1
Date: Tue, 10 Feb 2026 17:42:31 GMT
Title: Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference
Authors: Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang,
Abstract summary: DRIFT is a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process.<n>Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query.<n>Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of Large Language Models.
Score: 45.760483245296456
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.

Related papers

LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval [74.72139580745511]
LaSER is a novel self-distillation framework that internalizes explicit reasoning into the latent space of retrievers.<n>Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
arXiv Detail & Related papers (2026-03-02T04:11:18Z)
Representation Interventions Enable Lifelong Unstructured Knowledge Control [54.86207134539453]
Large language models (LLMs) often produce incorrect or outdated content. Updating their knowledge efficiently and accurately without costly retraining is a major challenge.<n>We introduce RILKE, a robust and scalable method that treats knowledge control as interventions within the model's representation space.<n>During training, RILKE learns paraphrase-robust and edit-localized modules that limit each update to a low-dimensional subspace to minimize cross-edit interference.<n>In inference, a query-adaptive router selects the appropriate module to guide the model's generation.
arXiv Detail & Related papers (2025-11-25T22:15:00Z)
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs [15.23174472320989]
Large Language Models (LLMs) are central to many contemporary AI applications.<n>Recent works in eXplainable AI (XAI) suggest that interpretability can also enable model compression.
arXiv Detail & Related papers (2025-06-16T17:38:36Z)
Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation [20.674323995662366]
Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge injection during large language model (LLM) inference in recent years.<n>Due to their limited ability to exploit fine-grained inter-document relationships, current RAG implementations face challenges in effectively addressing the retrieved noise and redundancy content.<n>We propose an Efficient Dynamic Clustering-based document Compression framework (EDC2-RAG) that utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content.
arXiv Detail & Related papers (2025-04-04T04:43:13Z)
Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement [22.386864304549285]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources and incorporating them into the context.<n>We propose Dynamic Parametric RAG (DyPRAG), a novel framework that leverages a lightweight parameter translator model to efficiently convert documents into parametric knowledge.
arXiv Detail & Related papers (2025-03-31T09:46:35Z)
BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.<n>This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.<n>Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
AutoRE: Document-Level Relation Extraction with Large Language Models [27.426703757501507]
We introduce AutoRE, an end-to-end DocRE model that adopts a novel RE extraction paradigm named RHF (Relation-Head-Facts) Unlike existing approaches, AutoRE does not rely on the assumption of known relation options, making it more reflective of real-world scenarios. Our experiments on the RE-DocRED dataset showcase AutoRE's best performance, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-03-21T23:48:21Z)
Extending Context Window of Large Language Models via Semantic Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z)
In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)
Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning [27.224364543134094]
We introduce a novel logic-driven data augmentation approach, AMR-LDA.<n> AMR-LDA converts the original text into an Abstract Meaning Representation (AMR) graph.<n>The modified AMR graphs are subsequently converted back into text to create augmented data.
arXiv Detail & Related papers (2023-05-21T23:16:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.