Related papers: InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

URL: http://arxiv.org/abs/2512.08829v1
Date: Tue, 09 Dec 2025 17:18:32 GMT
Title: InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Authors: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang,
Abstract summary: We propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet.<n>We show that InfiniteVL achieves over 3.6times inference speedup while maintaining constant latency and memory footprint.<n>In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache.
Score: 49.08289742711585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

Related papers

Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing [16.063514680699576]
Multimodal large language models (MLLMs) extend visual understanding through a three-stage pipeline.<n> multimodal preprocessing-especially video decoding-often dominates Time-to-First-Token (TTFT)<n>We present FlashCodec and UnifiedServe, two complementary designs that jointly optimize the end-to-end MLLM pipeline.
arXiv Detail & Related papers (2025-12-19T13:40:13Z)
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs [52.24096832965001]
We present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method.<n>The PVC method can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding.<n>Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x.
arXiv Detail & Related papers (2025-11-26T08:11:10Z)
FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning [16.753299634529736]
Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency.<n>Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios.<n>We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective.
arXiv Detail & Related papers (2025-11-22T02:25:00Z)
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference [49.84148668264725]
We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages.<n>Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks.
arXiv Detail & Related papers (2025-10-20T17:35:47Z)
SpecVLM: Fast Speculative Decoding in Vision-Language Models [14.243294546325714]
Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs)<n>We study speculative decoding for vision-language models (VLMs)<n>We introduce SpecVLM, a practical system that delivers 1.5--2.3x end-to-end speedups over full autoregressive inference.
arXiv Detail & Related papers (2025-09-15T11:53:56Z)
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding [29.719450799231705]
Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input.<n>Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets.<n>We propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism.
arXiv Detail & Related papers (2025-04-09T12:51:10Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [26.54297116028556]
Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks.<n>We introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention.<n>On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM.
arXiv Detail & Related papers (2025-02-20T18:59:52Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.