Related papers: Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

URL: http://arxiv.org/abs/2409.01179v3
Date: Thu, 19 Dec 2024 06:26:04 GMT
Title: Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
Authors: Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu,
Abstract summary: We propose a text information-guided dynamic visual token recovery mechanism that does not require training.<n>Our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity.
Score: 41.50379737105869
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question, providing additional knowledge to the model. To address the potential oversimplification and excessive pruning that can occur with most purely visual token pruning methods, we propose a text information-guided dynamic visual token recovery mechanism that does not require training. This mechanism leverages the similarity between the question text and visual tokens to recover visually meaningful tokens with important text information while merging other less important tokens. Experimental results demonstrate that our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity. Our source code will be made publicly available following acceptance.

Related papers

ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models [7.7352936204066]
We propose a novel, zero-shot method to model visual token pruning as a balance between task relevance and information diversity.<n>Our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss.<n>These gains are accompanied by significant reductions in GPU memory footprint and inference latency.
arXiv Detail & Related papers (2025-10-20T06:18:47Z)
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models [75.88232735646018]
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos.<n>Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations.<n>We propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM.
arXiv Detail & Related papers (2025-08-24T07:47:00Z)
Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z)
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries [37.37905881898424]
multimodal large language models (MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. The absence of a vision encoder implies that the model is likely to rely on substantial data to learn the necessary visual-semantic alignments. We present BREEN, a data-efficient encoder-free multimodal architecture that mitigates this issue.
arXiv Detail & Related papers (2025-03-16T10:43:14Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models [29.611769371733672]
We propose De Attention (D-Attn), a novel method that processes visual and textual embeddings differently. D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $mathcalO(|V|2)$ to $mathcalO(|V|)$ for $|V|$ visual embeddings without compromising performance.
arXiv Detail & Related papers (2025-02-04T00:46:11Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
We focus on improving the visual understanding capability for boosting the vision-language models. We propose textbfArcana, a multiModal language model, which introduces two crucial techniques.
arXiv Detail & Related papers (2024-10-17T16:36:38Z)
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding [39.68348330596116]
We propose modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs) Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features. modelnameachieves significant improvements in visual representation and benchmark performance.
arXiv Detail & Related papers (2024-10-15T17:55:22Z)
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities [31.108694010274988]
We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair. Our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models.
arXiv Detail & Related papers (2024-10-03T02:34:31Z)
Visual Grounding with Multi-modal Conditional Adaptation [14.177510695317098]
Visual grounding is the task of locating objects specified by natural language expressions. We introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights. MMCA achieves significant improvements and state-of-the-art results.
arXiv Detail & Related papers (2024-09-08T07:08:58Z)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z)
Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens. We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z)
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs. Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.