SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs
- URL: http://arxiv.org/abs/2512.00722v1
- Date: Sun, 30 Nov 2025 04:32:43 GMT
- Title: SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs
- Authors: Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, Guohao Dai,
- Abstract summary: We analyze the similarity in information focus between the distilled language model(DLM) and the original LLM from the perspective of information theory.<n>We present SpeContext, an algorithm and system co-design for long-context reasoning.
- Score: 13.231762612368376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we point out that the objective of the retrieval algorithms is to align with the LLM, which is similar to the objective of knowledge distillation in LLMs. We analyze the similarity in information focus between the distilled language model(DLM) and the original LLM from the perspective of information theory, and thus propose a novel paradigm that leverages a DLM as the retrieval algorithm. Based on the insight, we present SpeContext, an algorithm and system co-design for long-context reasoning. (1) At the algorithm level, SpeContext proposes lightweight retrieval head based on the head-level attention weights of DLM, achieving > 90% parameters reduction by pruning the redundancy. (2) At the system level, SpeContext designs an asynchronous prefetch dataflow via the elastic loading strategy, effectively overlapping KV cache retrieval with the LLM computation. (3) At the compilation level, SpeContext constructs the theoretical memory model and implements an adaptive memory management system to achieve acceleration by maximizing GPU memory utilization. We deploy and evaluate SpeContext in two resourceconstrained environments, cloud and edge. Extensive experiments show that, compared with the Huggingface framework, SpeContext achieves up to 24.89x throughput improvement in cloud and 10.06x speedup in edge with negligible accuracy loss, pushing the Pareto frontier of accuracy and throughput.
Related papers
- URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding [55.45331924836242]
We present URaG, a framework that Unifies Retrieval and Generation within a single MLLM.<n>We show that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%.
arXiv Detail & Related papers (2025-11-13T17:54:09Z) - SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs [74.2538340966038]
We investigate how Multimodal Large Language Models (MLLMs) process visual inputs by analyzing their attention mechanisms.<n>We reveal a surprising sparsity phenomenon: only a small subset of attention heads in LLMs actively contribute to visual understanding.<n>We introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores.
arXiv Detail & Related papers (2025-06-05T17:59:55Z) - TracLLM: A Generic Framework for Attributing Long Context LLMs [34.802736332993994]
We develop TracLLM, the first generic context traceback framework tailored to long context LLMs.<n>Our framework can improve the effectiveness and efficiency of existing feature attribution methods.<n>Our evaluation results show TracLLM can effectively identify texts in a long context that lead to the output of an LLM.
arXiv Detail & Related papers (2025-06-04T17:48:16Z) - Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.<n>This paper formulates LLM inference optimization as a multi-stage online scheduling problem.<n>We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z) - Parametric Retrieval Augmented Generation [32.29608109539912]
Parametric RAG is a new RAG paradigm that integrates external knowledge directly into the parameters of feed-forward networks.<n>It substantially enhances both the effectiveness and efficiency of knowledge augmentation in large language models.
arXiv Detail & Related papers (2025-01-27T10:04:49Z) - Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.<n>This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.<n>Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z) - Bridging LLMs and KGs without Fine-Tuning: Intermediate Probing Meets Subgraph-Aware Entity Descriptions [49.36683223327633]
Large Language Models (LLMs) encapsulate extensive world knowledge and exhibit powerful context modeling capabilities.<n>We propose a novel framework that synergizes the strengths of LLMs with robust knowledge representation to enable effective and efficient KGC.<n>We achieve a 47% relative improvement over previous methods based on non-fine-tuned LLMs and, to our knowledge, are the first to achieve classification performance comparable to fine-tuned LLMs.
arXiv Detail & Related papers (2024-08-13T10:15:55Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.