Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
- URL: http://arxiv.org/abs/2409.01179v1
- Date: Mon, 2 Sep 2024 11:19:54 GMT
- Title: Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
- Authors: Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu,
- Abstract summary: We propose a text information-guided dynamic visual token recovery mechanism that does not require training.
Our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity.
- Score: 41.50379737105869
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question, providing additional knowledge to the model. To address the potential oversimplification and excessive pruning that can occur with most purely visual token pruning methods, we propose a text information-guided dynamic visual token recovery mechanism that does not require training. This mechanism leverages the similarity between the question text and visual tokens to recover visually meaningful tokens with important text information while merging other less important tokens. Experimental results demonstrate that our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity. Our source code will be made publicly available following acceptance.
Related papers
- Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
We focus on improving the visual understanding capability for boosting the vision-language models.
We propose textbfArcana, a multiModal language model, which introduces two crucial techniques.
arXiv Detail & Related papers (2024-10-17T16:36:38Z) - From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities [31.108694010274988]
We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair.
Our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models.
arXiv Detail & Related papers (2024-10-03T02:34:31Z) - Visual Grounding with Multi-modal Conditional Adaptation [14.177510695317098]
Visual grounding is the task of locating objects specified by natural language expressions.
We introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights.
MMCA achieves significant improvements and state-of-the-art results.
arXiv Detail & Related papers (2024-09-08T07:08:58Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression [23.966237939194514]
We present the study on the analysis of redundancy concerning visual tokens and efficient training within large multi-language models.
Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy.
We introduce Visual Context, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance.
arXiv Detail & Related papers (2024-06-28T17:57:14Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model.
Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly.
We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.