LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression
- URL: http://arxiv.org/abs/2509.00419v1
- Date: Sat, 30 Aug 2025 08:57:53 GMT
- Title: LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression
- Authors: Lianyu Hu, Fanhua Shang, Wei Feng, Liang Wan,
- Abstract summary: We introduce LightVLM, a simple but effective method to greatly accelerate the inference process in a training-free manner.<n>We divide the inference procedure of Vision-Language Models into two stages, i.e., encoding and decoding.<n> Experimental results show that LightVLM successfully retains 100% performance when only preserving 35% image tokens.
- Score: 33.64434029721401
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce LightVLM, a simple but effective method that can be seamlessly deployed upon existing Vision-Language Models (VLMs) to greatly accelerate the inference process in a training-free manner. We divide the inference procedure of VLMs into two stages, i.e., encoding and decoding, and propose to simultaneously accelerate VLMs in both stages to largely improve model efficiency. During encoding, we propose pyramid token merging to reduce tokens of different LLM layers in a hierarchical manner by finally only keeping a few dominant tokens to achieve high efficiency. During decoding, aimed at reducing the high latency of outputting long sequences, we propose KV Cache compression to remove unnecessary caches to increase the network throughput. Experimental results show that LightVLM successfully retains 100% performance when only preserving 35% image tokens, and maintains around 98% performance when keeping only 3% image tokens. LightVLM could 2.02$\times$ the network throughput and reduce the prefilling time by 3.65$\times$. LightVLM also makes large VLMs faster again by enabling a heavy model (e.g., InternVL2.5 26B) to infer faster than significantly smaller models (e.g., InternVL2.5 8B), hopefully facilitating the real-world deployment. When generating long text sequences (e.g., 4096 tokens), LightVLM could reduce the inference time by 3.21$\times$, largely outperforming existing methods.
Related papers
- PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective [59.24570811503256]
We propose PIO-FVLM to reduce redundant visual tokens in vision-models (VLMs) to accelerate inference.<n>The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment.<n>On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance.
arXiv Detail & Related papers (2026-02-04T15:33:10Z) - InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models [49.08289742711585]
We propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet.<n>We show that InfiniteVL achieves over 3.6times inference speedup while maintaining constant latency and memory footprint.<n>In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache.
arXiv Detail & Related papers (2025-12-09T17:18:32Z) - SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead [56.74139420555097]
We propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency.<n>Our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images.<n>Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger.
arXiv Detail & Related papers (2025-11-30T14:10:28Z) - Attention Is All You Need for KV Cache in Diffusion LLMs [36.94369617373333]
Elastic-Cache performs adaptive, layer-aware cache updates for diffusion large language models.<n>Our method achieves significantly higher throughput ($6.8times$ on GSM8K) than existing confidence-based approaches.
arXiv Detail & Related papers (2025-10-16T17:59:48Z) - Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs [57.69190972274813]
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models.<n>ExistingDLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation.<n>We introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding inDLLMs.
arXiv Detail & Related papers (2025-07-24T16:51:33Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - FastVLM: Efficient Vision Encoding for Vision Language Models [22.41836943083826]
We introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy.<n>FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens.<n>In the LLaVA-1.5 setup, FastVLM achieves 3.2$times$ improvement in time-to-first-token.
arXiv Detail & Related papers (2024-12-17T20:09:55Z) - iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models [24.0346607116299]
We introduce iLLaVA, a simple method that can be seamlessly deployed upon current Large Vision-Language Models (LVLMs)<n>iLLaVA achieves this by finding and gradually merging the redundant tokens with an accurate and fast algorithm.<n>On tasks across different domains including single-image, multi-images and videos, iLLaVA demonstrates strong generalizability with consistently promising efficiency.
arXiv Detail & Related papers (2024-12-09T07:22:19Z) - VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration [7.463830743649754]
Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks.
Key-Value (KV) cache encodes long visual contexts, such as images or videos.
Existing KV cache compression methods are effective for Large Language Models (LLMs)
We propose a novel KV cache compression recipe tailored for accelerating VLM inference.
arXiv Detail & Related papers (2024-10-29T20:04:34Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding [15.723047976314751]
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following.
We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding.
arXiv Detail & Related papers (2024-02-26T18:59:28Z) - MiniVLM: A Smaller and Faster Vision-Language Model [76.35880443015493]
MiniVLM consists of two modules, a vision feature extractor and a vision-language fusion module.
MiniVLM reduces the model size by $73%$ and the inference time cost by $94%$.
arXiv Detail & Related papers (2020-12-13T03:02:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.