Related papers: Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

URL: http://arxiv.org/abs/2505.11945v2
Date: Thu, 22 May 2025 15:10:59 GMT
Title: Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Authors: Bonan li, Zicheng Zhang, Songhua Liu, Weihao Yu, Xinchao Wang,
Abstract summary: Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
Score: 70.57180215148125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual instruction tuning aims to enable large language models to comprehend the visual world, with a pivotal challenge lying in establishing an effective vision-to-language projection. However, existing methods often grapple with the intractable trade-off between accuracy and efficiency. In this paper, we present LLaVA-Meteor, a novel approach designed to break this deadlock, equipped with a novel Top-Down Compression paradigm that strategically compresses visual tokens without compromising core information. Specifically, we construct a trainable Flash Global Fusion module based on efficient selective state space operators, which aligns the feature space while enabling each token to perceive holistic visual context and instruction preference at low cost. Furthermore, a local-to-single scanning manner is employed to effectively capture local dependencies, thereby enhancing the model's capability in vision modeling. To alleviate computational overhead, we explore a Visual-Native Selection mechanism that independently assesses token significance by both the visual and native experts, followed by aggregation to retain the most critical subset. Extensive experiments show that our approach reduces visual tokens by 75--95% while achieving comparable or superior performance across 12 benchmarks, significantly improving efficiency.

Related papers

ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning [8.933549837045932]
Large Vision-Language Models incur high computational costs due to significant redundancy in their visual tokens.<n>We propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the Large Language Models.
arXiv Detail & Related papers (2026-01-25T12:47:30Z)
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition [21.989922235835067]
AdaptVision is an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach.<n>We show that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
arXiv Detail & Related papers (2025-12-03T13:43:30Z)
A Simple and Better Baseline for Visual Grounding [41.76403278559263]
We propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG.<n>Specifically, we encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures.<n>We introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction.
arXiv Detail & Related papers (2025-10-12T13:06:59Z)
Infusing fine-grained visual knowledge to Vision-Language Models [5.487134463783365]
Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs)<n>We propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge.<n>Our approach consistently achieves strong results, notably retaining the visual-text alignment without utilizing any text data or the original text encoder during fine-tuning.
arXiv Detail & Related papers (2025-08-16T19:12:09Z)
End-to-End Vision Tokenizer Tuning [73.3065542220568]
The vision tokenizer optimized for low-level reconstruction is to downstream tasks requiring varied representations and semantics.<n>The loss of the vision tokenization can be the representation bottleneck for target tasks.<n>We propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks.
arXiv Detail & Related papers (2025-05-15T17:59:39Z)
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.<n>Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z)
Learning Free Token Reduction for Multi-Modal Large Language Models [3.4026156483879517]
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks.<n>However, their practical deployment is often constrained by high computational costs and prolonged inference times.<n>We propose a token compression paradigm that operates on both spatial and temporal dimensions.
arXiv Detail & Related papers (2025-01-29T02:52:32Z)
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression [45.37530855889661]
High-resolution images lead to a quadratic increase in the number of visual tokens input into Multi-modal Large Language Models. Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance. We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.
arXiv Detail & Related papers (2024-11-21T15:37:52Z)
Adaptive Masking Enhances Visual Grounding [12.793586888511978]
We propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, to enhance vocabulary grounding in low-shot learning scenarios. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks.
arXiv Detail & Related papers (2024-10-04T05:48:02Z)
Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image. We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z)
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning [19.28860833813788]
Existing models commonly train a visual encoder with weak cross-modal supervision signals. We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
arXiv Detail & Related papers (2023-10-23T10:37:22Z)
Heterogeneous Contrastive Learning: Encoding Spatial Information for Compact Visual Representations [183.03278932562438]
This paper presents an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations. We show that our approach achieves higher efficiency in visual representations and thus delivers a key message to inspire the future research of self-supervised visual representation learning.
arXiv Detail & Related papers (2020-11-19T16:26:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.