Related papers: ApET: Approximation-Error Guided Token Compression for Efficient VLMs

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

URL: http://arxiv.org/abs/2602.19870v1
Date: Mon, 23 Feb 2026 14:15:37 GMT
Title: ApET: Approximation-Error Guided Token Compression for Efficient VLMs
Authors: Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, Hairong Zheng,
Abstract summary: We present ApET, an Approximation-Error guided Token compression framework.<n>We show that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks.<n>Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference and making VLM deployment more practical.
Score: 16.4657793751671
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.

Related papers

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference [14.714791872881397]
DUET-VLM is a versatile plug-and-play dual compression framework.<n>It enables robust adaptation to reduced visual (image/video) input without sacrificing accuracy.<n>Our results highlight end-to-end training with DUET-VLM.
arXiv Detail & Related papers (2026-02-21T14:22:49Z)
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs [82.72388893596555]
Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks.<n>Previous token compression techniques are often constrained by rules that risk discarding critical information.<n>We reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process.
arXiv Detail & Related papers (2025-10-18T17:54:18Z)
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention [50.97683288777336]
Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens.<n>Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention.<n>We propose HoloV, a plug-and-play visual token pruning framework for efficient inference.
arXiv Detail & Related papers (2025-10-03T11:33:40Z)
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models [24.952668143243542]
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks.<n> Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency.<n>We propose Variation-aware Vision Token Dropping (textiti.e., textbfV$2$Drop), which progressively removes visual tokens with minimal variation during LVLM inference.
arXiv Detail & Related papers (2025-09-01T15:28:44Z)
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM [41.796933489107815]
We identify and study the computation-level redundancy on vision tokens to ensure no information loss.<n>We propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens.
arXiv Detail & Related papers (2025-05-21T17:59:52Z)
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.<n>Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z)
AdaFV: Rethinking of Visual-Language alignment for VLM acceleration [7.9213473377478865]
Some approaches to reduce the visual tokens according to the self-attention of VLMs, which are biased, result in inaccurate responses.<n>We propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity.<n>The proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
arXiv Detail & Related papers (2025-01-16T13:34:33Z)
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs [65.00970402080351]
A promising approach to accelerating large vision-language models (VLMs) is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens.<n>Our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM,
arXiv Detail & Related papers (2024-12-04T13:56:44Z)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.