Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation
- URL: http://arxiv.org/abs/2412.09817v1
- Date: Fri, 13 Dec 2024 03:13:44 GMT
- Title: Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation
- Authors: Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, Jiawei Yao,
- Abstract summary: The interpretability of LVLMs remains an under-explored area.
In models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence.
We propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs.
- Score: 7.742746565876165
- License:
- Abstract: Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. The paper's source code can be accessed from \url{https://github.com/FanshuoZeng/Simignore}.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - A New Method to Capturing Compositional Knowledge in Linguistic Space [0.0]
ZS-CU is a novel task that enhances compositional understanding without requiring hard negative training data.
We propose YUKINO, which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model.
YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark.
arXiv Detail & Related papers (2024-12-20T07:48:09Z) - The Narrow Gate: Localized Image-Text Communication in Vision-Language Models [36.33608889682152]
We compare vision-language models (VLMs) that generate both images and text with those that output only text.
We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream.
In contrast, models trained for image and text generation rely on a single token that acts as a narrow gate for the visual information.
arXiv Detail & Related papers (2024-12-09T16:39:40Z) - LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We then help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding.
Our method demonstrates superior performance in long-text-image retrieval tasks.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - Chain of Images for Intuitively Reasoning [23.692458865558486]
We present a Chain of Images (CoI) approach to convert complex language reasoning problems to simple pattern recognition.
We have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving.
In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions.
arXiv Detail & Related papers (2023-11-09T11:14:51Z) - PuMer: Pruning and Merging Tokens for Efficient Vision Language Models [41.81484883647005]
PuMer is a framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text.
PuMer inference increases throughput by up to 2x and reduces memory footprint by over 50% while incurring less than a 1% accuracy drop.
arXiv Detail & Related papers (2023-05-27T17:16:27Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.