Retrieval Replace Reduction: An effective visual token reduction method via semantic match
- URL: http://arxiv.org/abs/2410.07278v1
- Date: Wed, 9 Oct 2024 07:13:22 GMT
- Title: Retrieval Replace Reduction: An effective visual token reduction method via semantic match
- Authors: Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li,
- Abstract summary: We introduce textbfTRSM (textbfToken textbfReduction via textbfSemantic textbfMatch), which effectively reduces the number of visual tokens without compromising MLLM performance.
Inspired by how humans process multimodal tasks, TRSM leverages semantic information from one modality to match relevant semantics in another, reducing the number of visual tokens.
Based on experimental results, our approach compresses the visual tokens by 20%, achieving comparable performance across diverse visual question-answering and reasoning tasks.
- Score: 32.33892531885448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) have demonstrated strong performance across various tasks without requiring training from scratch. However, they face significant computational and memory constraints, particularly when processing multimodal inputs that exceed context length, limiting their scalability. In this paper, we introduce a new approach, \textbf{TRSM} (\textbf{T}oken \textbf{R}eduction via \textbf{S}emantic \textbf{M}atch), which effectively reduces the number of visual tokens without compromising MLLM performance. Inspired by how humans process multimodal tasks, TRSM leverages semantic information from one modality to match relevant semantics in another, reducing the number of visual tokens.Specifically, to retain task relevant visual tokens, we use the text prompt as a query vector to retrieve the most similar vectors from the visual prompt and merge them with the text tokens. Based on experimental results, when applied to LLaVA-1.5\cite{liu2023}, our approach compresses the visual tokens by 20\%, achieving comparable performance across diverse visual question-answering and reasoning tasks.
Related papers
- Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product [8.014705094248589]
Low- parameters prompt tuning method outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
Experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
arXiv Detail & Related papers (2025-02-16T05:50:12Z) - Learning Free Token Reduction for Multi-Modal LLM [3.4026156483879517]
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks.
However, their practical deployment is often constrained by high computational costs and prolonged inference times.
We propose a token compression paradigm that operates on both spatial and temporal dimensions.
arXiv Detail & Related papers (2025-01-29T02:52:32Z) - FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance [7.889590793589825]
We introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence.
We analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy.
FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
arXiv Detail & Related papers (2025-01-05T03:28:45Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.
compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.
We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression [45.37530855889661]
High-resolution images lead to a quadratic increase in the number of visual tokens input into Multi-modal Large Language Models.
Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance.
We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.
arXiv Detail & Related papers (2024-11-21T15:37:52Z) - Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference.
We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy.
Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z) - Restore Anything Model via Efficient Degradation Adaptation [129.38475243424563]
RAM takes a unified path that leverages inherent similarities across various degradations to enable efficient and comprehensive restoration.
RAM's SOTA performance confirms RAM's SOTA performance, reducing model complexity by approximately 82% in trainable parameters and 85% in FLOPs.
arXiv Detail & Related papers (2024-07-18T10:26:53Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - ParaFormer: Parallel Attention Transformer for Efficient Feature
Matching [8.552303361149612]
This paper proposes a novel parallel attention model entitled ParaFormer.
It fuses features and keypoint positions through the concept of amplitude and phase, and integrates self- and cross-attention in a parallel manner.
Experiments on various applications, including homography estimation, pose estimation, and image matching, demonstrate that ParaFormer achieves state-of-the-art performance.
The efficient ParaFormer-U variant achieves comparable performance with less than 50% FLOPs of the existing attention-based models.
arXiv Detail & Related papers (2023-03-02T03:29:16Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.