QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
- URL: http://arxiv.org/abs/2505.18092v2
- Date: Tue, 27 May 2025 09:42:25 GMT
- Title: QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
- Authors: Weizhou Shen, Chenliang Li, Fanqi Wan, Shengyi Liao, Shaopeng Lai, Bo Zhang, Yingcheng Shi, Yuning Wu, Gang Fu, Zhansheng Li, Bin Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan,
- Abstract summary: QwenLong-CPRS is a context compression framework designed for explicit long-context optimization.<n>QwenLong-CPRS achieves 21.59$times$ context compression alongside 19.15-point average performance gains.
- Score: 70.3105638352827
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.
Related papers
- H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages [0.6629765271909505]
H-NET++ is a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training.<n>On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results.
arXiv Detail & Related papers (2025-08-07T17:59:01Z) - LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning [38.679497621876926]
LIFT is a large language model (LLM)-based coding assistant for HLS that automatically generates performance-critical pragmas.<n>We fine-tune the LLM by tightly integrating and supervising the training process with a graph neural network (GNN)
arXiv Detail & Related papers (2025-04-29T21:42:59Z) - Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.<n> APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.<n>A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z) - ASMA-Tune: Unlocking LLMs' Assembly Code Comprehension via Structural-Semantic Instruction Tuning [33.53059396922164]
Assembly code analysis and comprehension play critical roles in applications like reverse engineering.<n>Traditional masked language modeling approaches do not explicitly focus on natural language interaction.<n>We present Assembly Augmented Tuning, an end-to-end structural-semantic instruction tuning framework.
arXiv Detail & Related papers (2025-03-14T17:36:08Z) - Qwen2.5-1M Technical Report [72.09755998661568]
We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens.<n>By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup.
arXiv Detail & Related papers (2025-01-26T03:47:25Z) - Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models [75.58140912100318]
We introduce an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention.<n>We conduct experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV.<n>We introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
arXiv Detail & Related papers (2025-01-23T12:58:14Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.