Efficient Context Scaling with LongCat ZigZag Attention
- URL: http://arxiv.org/abs/2512.23966v2
- Date: Tue, 06 Jan 2026 14:12:55 GMT
- Title: Efficient Context Scaling with LongCat ZigZag Attention
- Authors: Chen Zhang, Yang Bai, Jiahuan Li, Anchun Gui, Keheng Wang, Feifan Liu, Guanyu Wu, Yuwei Jiang, Defei Bu, Li Wei, Haihang Jing, Hongyin Tang, Xin Chen, Xiangzhou Huang, Fengcun Li, Rongxiang Weng, Yulei Qian, Yifan Lu, Yerui Sun, Jingang Wang, Yuchen Xie, Xunliang Cai,
- Abstract summary: LongCat ZigZag Attention (LoZA) is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget.<n>LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases.
- Score: 39.95366576062524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.
Related papers
- LongCat-Flash Technical Report [165.64670448930875]
LongCat-Flash is a 560-billion- parameter Mixture-of-Experts (MoE) language model.<n>It is designed for both computational efficiency and advanced agentic capabilities.<n>We complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of $0.70 per million output tokens.
arXiv Detail & Related papers (2025-09-01T10:05:45Z) - SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences [11.225649178057695]
SpecExtend improves speculative decoding on long sequences without additional training.<n>To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval.<n>SpecExtend accelerates speculative decoding by up to 2.84x on 16K-token long summarization and up to 3.86x on long reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:00Z) - From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling.<n>We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens.<n>Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z) - LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
LongSpec is a framework that addresses the challenges of efficient inference over long contexts.<n>LongSpec achieves up to a 3.26x speedup over strong Flash Attention baselines.<n>The code is available at https://github.com/sail-sg/LongSpec.
arXiv Detail & Related papers (2025-02-24T18:53:31Z) - LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [26.54297116028556]
Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks.<n>We introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention.<n>On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM.
arXiv Detail & Related papers (2025-02-20T18:59:52Z) - MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding [12.74265334789358]
We show that speculative decoding can achieve speedup even for a high throughput inference regime for moderate to long sequences.<n>We propose a theoretical model to select the optimal drafting strategy for maximum speedup.<n>For moderate to long sequences, we demonstrate up to 2.51x speedup for Llama3.1-8B when serving batch sizes ranging from 32 to 256.
arXiv Detail & Related papers (2024-08-20T17:57:31Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models [67.58275666573496]
LongLoRA is an efficient fine-tuning approach that extends the context sizes of pre-trained large language models.
We demonstrate strong empirical results on various tasks on Llama2 models from 7B/13B to 70B.
arXiv Detail & Related papers (2023-09-21T17:59:11Z) - Augmenting Language Models with Long-Term Memory [142.04940250657637]
Existing large language models (LLMs) can only afford fix-sized inputs due to the input length limit.
We propose a framework, Language Models Augmented with Long-Term Memory (LongMem), which enables LLMs to memorize long history.
arXiv Detail & Related papers (2023-06-12T15:13:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.