Related papers: Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

URL: http://arxiv.org/abs/2410.06169v2
Date: Fri, 15 Nov 2024 18:43:23 GMT
Title: Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
Authors: Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, Chenliang Xu,
Abstract summary: Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens. As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
Score: 37.7015406019386
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language Models (LLMs). However, as token counts grow, the quadratic scaling of computation in LLMs introduces a significant efficiency bottleneck, impeding further scalability. Although recent approaches have explored pruning visual tokens or employing lighter LLM architectures, the computational overhead from an increasing number of visual tokens remains a substantial challenge. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA, a representative MLLM, and introduce a suite of streamlined strategies to enhance efficiency. These include neighbor-aware visual token attention, pruning of inactive visual attention heads, and selective layer dropping for visual computations. By implementing these strategies in LLaVA, we achieve a reduction in computational demands of 88% while maintaining model performance across key benchmarks. Additionally, we validate the existence of visual computational redundancy in other MLLMs, such as Qwen2-VL-7B and InternVL-2.0-4B/8B/26B. These results present a novel pathway for MLLMs to handle dense visual tokens with minimal computational costs. Code and model checkpoints will be released to support further research.

Related papers

Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs [8.97780713904412]
This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in Large Vision-Language Models (LVLMs)<n>Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. Experiments on three LVLM benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead.
arXiv Detail & Related papers (2025-06-11T08:46:55Z)
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms [16.41418610688371]
We introduce CrossLMM, which substantially reduces visual token quantity with minimal performance degradation.<n>We also introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens.<n>Our approach achieves comparable or superior performance across diverse video-based Large Language Models benchmarks.
arXiv Detail & Related papers (2025-05-22T17:59:53Z)
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE) DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy [37.471419716572086]
There is a significant gap in instruction-following capabilities between Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) We propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap.
arXiv Detail & Related papers (2024-11-23T05:03:32Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application. We studied the visual tokens of MM-LLMs and designed a dynamic pruning algorithm to address this issue. Our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity.
arXiv Detail & Related papers (2024-09-02T10:49:10Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language. This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok) SeTok groups visual features into semantic units via a dynamic clustering algorithm. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.