TokenPacker: Efficient Visual Projector for Multimodal LLM
- URL: http://arxiv.org/abs/2407.02392v4
- Date: Wed, 28 Aug 2024 08:49:57 GMT
- Title: TokenPacker: Efficient Visual Projector for Multimodal LLM
- Authors: Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang,
- Abstract summary: The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM)
We propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens.
Our approach compresses the visual tokens by 75%89%, while achieves comparable or even better performance across diverse benchmarks.
- Score: 37.1071749188282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.
Related papers
- FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.
VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens.
We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures [5.513631883813244]
We propose a framework that textbfPre-textbfIntegratestextbfPrompt information into the visual encoding process using existingmodules of MLLMs.
Our model maintains excellent generation even when half of the visual tokens are reduced.
arXiv Detail & Related papers (2024-10-30T15:05:17Z) - Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation [10.468784974994465]
The projector plays a crucial role in multi-modal language models (MLLMs)
Current explorations on the projector focus on reducing the number of visual tokens to improve efficiency.
A Spatial-Aware Efficient Projector (SAEP) is proposed to address this issue.
arXiv Detail & Related papers (2024-10-14T09:25:09Z) - Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens.
As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck.
In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z) - Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application.
We studied the visual tokens of MM-LLMs and designed a dynamic pruning algorithm to address this issue.
Our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity.
arXiv Detail & Related papers (2024-09-02T10:49:10Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.
Our approach is inspired by two intriguing phenomena we have observed.
Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z) - Auto-Encoding Morph-Tokens for Multimodal LLM [151.2618346912529]
We propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts.
Experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously.
arXiv Detail & Related papers (2024-05-03T08:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.