Related papers: BlindSight: Harnessing Sparsity for Efficient VLMs

BlindSight: Harnessing Sparsity for Efficient VLMs

URL: http://arxiv.org/abs/2507.09071v1
Date: Fri, 11 Jul 2025 23:15:30 GMT
Title: BlindSight: Harnessing Sparsity for Efficient VLMs
Authors: Tharun Adithya Srikrishnan, Deval Shah, Steven K. Reinhardt,
Abstract summary: We propose BlindSight: a training-free approach to optimize VLM inference using a input template-aware attention sparsity mask.<n>BlindSight results in a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy compared to the original model in most evaluated multi-image understanding benchmarks.
Score: 4.756688231351083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large vision-language models (VLMs) enable the joint processing of text and images. However, the inclusion of vision data significantly expands the prompt length. Along with the quadratic complexity of the attention computation, this results in a longer prefill duration. An approach to mitigate this bottleneck is to leverage the inherent sparsity in the attention computation. In our analysis of attention patterns in VLMs, we observe that a substantial portion of layers exhibit minimal cross-image attention, except through attention-sink tokens per image. These sparse attention patterns fall into distinct categories: sink-only, document mask and a hybrid document-sink mask. Based on this, we propose BlindSight: a training-free approach to optimize VLM inference using a input template-aware attention sparsity mask. We utilize samples from a dataset to derive a prompt-agnostic sparsity categorization for every attention head. We evaluate the proposed technique using VLMs such as Qwen2-VL, Qwen2.5-VL and Gemma-3. BlindSight results in a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy compared to the original model in most evaluated multi-image understanding benchmarks.

Related papers

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering [5.840924060437216]
We propose a training-free visual cropping method, dubbed FOCUS, to guide the search for the most relevant image region.<n> FOCUS achieves strong performance across four fine-grained VQA datasets and two types of MLLMs.<n>It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye.
arXiv Detail & Related papers (2025-06-26T18:51:04Z)
High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution [87.56382172827526]
High-frequency regions are most critical for reconstruction.<n>We propose a training-free adaptive masking module for acceleration.<n>Our method reduces FLOPs by 24--43% for state-of-the-art models.
arXiv Detail & Related papers (2025-05-11T13:18:03Z)
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features [24.33252753245426]
We exploit the sparse nature in cross-attention maps to selectively prune redundant visual features.<n>Our model can reduce inference latency and memory usage while achieving benchmark parity.
arXiv Detail & Related papers (2025-04-01T09:10:32Z)
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs [65.00970402080351]
A promising approach to accelerating large vision-language models (VLMs) is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens.<n>Our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM,
arXiv Detail & Related papers (2024-12-04T13:56:44Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
FilterViT and DropoutViT [0.0]
We introduce an enhanced version of ViT that conducts attention-based QKV operations during the initial stages of downsampling. We propose a filter attention mechanism that uses a Filter Block to create a salient mask for selecting the most informative pixels for attention. This approach effectively decreases the number of tokens involved in the attention, reducing computational complexity and boosting processing speed.
arXiv Detail & Related papers (2024-10-30T05:38:03Z)
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity [85.44800864697464]
We introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. We show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference.
arXiv Detail & Related papers (2024-09-20T10:50:21Z)
Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens. We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z)
Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence. We propose an efficient mask propagation framework for VSS, called SSSS. Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.