Related papers: Latent Context Compilation: Distilling Long Context into Compact Portable Memory

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

URL: http://arxiv.org/abs/2602.21221v1
Date: Sat, 31 Jan 2026 08:38:07 GMT
Title: Latent Context Compilation: Distilling Long Context into Compact Portable Memory
Authors: Zeju Li, Yizhou Zhou, Qiang Xu,
Abstract summary: We propose Latent Context Compilation, a framework that shifts context processing from adaptation to compilation.<n>By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens.<n> Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities.
Score: 13.768393657432027
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires modifying model weights, creating stateful parameters that complicate concurrent serving. We propose Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation. By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens -- stateless, portable memory artifacts that are plug-and-play compatible with frozen base models. Crucially, we introduce a self-aligned optimization strategy that eliminates the need for synthetic context-relevant QA pairs. By regularizing context reconstruction task with context-agnostic random queries, we force compressed tokens to reside within the model's existing instruction-following manifold. Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities where prior methods falter, effectively decoupling memory density from model parameters even at a 16x compression ratio.

Related papers

Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning [47.87361916374891]
We propose a framework for efficient long-context inference based on chunk-wise compression and selective memory recall.<n>The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor.<n>It achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
arXiv Detail & Related papers (2026-02-09T08:33:11Z)
SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z)
CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning [34.38636514331703]
CLaRa is a unified framework that performs embedding-based compression and joint optimization in a shared continuous space.<n> Experiments show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
arXiv Detail & Related papers (2025-11-24T00:11:14Z)
CompLLM: Compression for Long Context Q&A [47.90063873976842]
We introduce CompLLM, a soft compression technique designed for practical deployment.<n>Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently.<n>Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%.
arXiv Detail & Related papers (2025-09-23T16:49:43Z)
CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling [52.05149789178508]
CCF is a novel context compression framework designed to enable efficient long-context modeling.<n>CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations.<n> Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios.
arXiv Detail & Related papers (2025-09-11T07:13:49Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Recurrent Context Compression: Efficiently Expanding the Context Window of LLM [22.595457889113668]
This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of Transformer-based large language models (LLMs) We validated our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100% accuracy on a passkey retrieval task with a sequence length of 1M.
arXiv Detail & Related papers (2024-06-10T08:50:59Z)
Compressed Context Memory For Online Language Model Interaction [39.72054168889216]
This paper presents a context key/value compression method for Transformer language models in online scenarios. As the context lengthens, the attention process demands increasing memory and computations, which in turn reduces the throughput of the language model. We propose a compressed context memory system that continually compresses the accumulating attention key/value pairs into a compact memory space.
arXiv Detail & Related papers (2023-12-06T10:50:43Z)
In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.