PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models
- URL: http://arxiv.org/abs/2410.07278v2
- Date: Mon, 02 Dec 2024 08:43:33 GMT
- Title: PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models
- Authors: Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li,
- Abstract summary: Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks.<n>But their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs.<n>We introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance.
- Score: 32.33892531885448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address this, we introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance. Unlike previous methods that rely heavily on attention mechanisms and overlooking cross-modal interactions , we uses a prompt-aware strategy to adpative identify and cluster essential visual tokens. PAR categorizes visual context redundancy into two types: external and internal. External redundancy is minimized through semantic retrieval, while internal redundancy is addressed using a token routing mechanism. This method substantially reduces computational load without requiring additional training or complex architectural modifications. \textbf{Experimental results demonstrate that across various visual question answering tasks, PAR reduces FLOPs by 83\% with a compression ratio of 89\%, while retaining 97\% of baseline accuracy.} The adaptive design of PAR achieves a 2x token reduction ratio compared to prior approaches, enabling a better balance between performance and efficiency.
Related papers
- DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)
Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.
Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.
Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.
To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z) - Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding.
We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models.
Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z) - TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.
Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z) - Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [49.77734021302196]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.
To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.
Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z) - Learning Free Token Reduction for Multi-Modal Large Language Models [3.4026156483879517]
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks.
However, their practical deployment is often constrained by high computational costs and prolonged inference times.
We propose a token compression paradigm that operates on both spatial and temporal dimensions.
arXiv Detail & Related papers (2025-01-29T02:52:32Z) - FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance [9.782362715017596]
We introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence.
We analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy.
FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
arXiv Detail & Related papers (2025-01-05T03:28:45Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.
compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.
We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression [45.37530855889661]
High-resolution images lead to a quadratic increase in the number of visual tokens input into Multi-modal Large Language Models.
Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance.
We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.
arXiv Detail & Related papers (2024-11-21T15:37:52Z) - Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens.
As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck.
In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z) - Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application.
We studied the visual tokens of MM-LLMs and designed a dynamic pruning algorithm to address this issue.
Our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity.
arXiv Detail & Related papers (2024-09-02T10:49:10Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs)
We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them.
We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens [21.61634020256455]
Transformer-based large language models (LLMs) suffer a performance degradation when modeling long-term contexts.
We propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks.
arXiv Detail & Related papers (2024-06-16T15:50:10Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.
Our approach is inspired by two intriguing phenomena we have observed.
Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z) - LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model.
Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly.
We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z) - Identifying and Analyzing Task-Encoding Tokens in Large Language Models [55.03191279766383]
In this paper, we identify and analyze task-encoding tokens on whose representations the task performance depends.
We show that template and stopword tokens are the most prone to be task-encoding.
Our work sheds light on how large language models (LLMs) learn to perform a task from demonstrations, deepens our understanding of the varied roles different types of tokens play in LLMs, and provides insights for avoiding instability from improperly utilizing task-encoding tokens.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - ParaFormer: Parallel Attention Transformer for Efficient Feature
Matching [8.552303361149612]
This paper proposes a novel parallel attention model entitled ParaFormer.
It fuses features and keypoint positions through the concept of amplitude and phase, and integrates self- and cross-attention in a parallel manner.
Experiments on various applications, including homography estimation, pose estimation, and image matching, demonstrate that ParaFormer achieves state-of-the-art performance.
The efficient ParaFormer-U variant achieves comparable performance with less than 50% FLOPs of the existing attention-based models.
arXiv Detail & Related papers (2023-03-02T03:29:16Z) - Visually-augmented pretrained language models for NLP tasks without
images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation.
We propose a novel textbfVisually-textbfAugmented fine-tuning approach.
Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.