Related papers: Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration

Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration

URL: http://arxiv.org/abs/2501.05179v3
Date: Sun, 16 Feb 2025 18:33:57 GMT
Title: Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration
Authors: Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen,
Abstract summary: Multimodal large language models (MLLMs) have attracted considerable attention due to their exceptional performance in visual content understanding and reasoning. Token compression techniques, which reduce the number of visual tokens, have demonstrated their effectiveness in reducing computational costs. We propose a novel token compression method, GlobalCom$2$, tailored for high-resolution MLLMs.
Score: 28.311125014789905
License:
Abstract: Multimodal large language models (MLLMs) have attracted considerable attention due to their exceptional performance in visual content understanding and reasoning. However, their inference efficiency has been a notable concern, as the increasing length of multimodal contexts leads to quadratic complexity. Token compression techniques, which reduce the number of visual tokens, have demonstrated their effectiveness in reducing computational costs. Yet, these approaches have struggled to keep pace with the rapid advancements in MLLMs, especially the AnyRes strategy in the context of high-resolution image understanding. In this paper, we propose a novel token compression method, GlobalCom$^2$, tailored for high-resolution MLLMs that receive both the thumbnail and multiple crops. GlobalCom$^2$ treats the tokens derived from the thumbnail as the "commander" of the entire token compression process, directing the allocation of retention ratios and the specific compression for each crop. In this way, redundant tokens are eliminated while important local details are adaptively preserved to the highest extent feasible. Empirical results across 10 benchmarks reveal that GlobalCom$^2$ achieves an optimal balance between performance and efficiency, and consistently outperforms state-of-the-art token compression methods with LLaVA-NeXT-7B/13B models. Our code is released at https://github.com/xuyang-liu16/GlobalCom2.

Related papers

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone. The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z)
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding [11.211803499867639]
We propose DYTO, a novel dynamic token merging framework for zero-shot video understanding. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences. Experiments demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods.
arXiv Detail & Related papers (2024-11-21T18:30:11Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid [87.09900996643516]
We introduce a Complementary Image Pyramid (CIP) to mitigate semantic discontinuity during high-resolution image processing. We also introduce a Scale Compression Mechanism (SCM) to reduce the additional computational overhead by compressing the redundant visual tokens. Our experiments demonstrate that CIP can consistently enhance the performance across diverse architectures.
arXiv Detail & Related papers (2024-08-04T13:55:58Z)
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z)
VoCo-LLaMA: Towards Vision Compression with Large Language Models [56.20788367278211]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z)
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models [44.437693135170576]
We propose a new framework, LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME) We extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. The proposed method achieves leading performance across various benchmarks with only 2 million training data.
arXiv Detail & Related papers (2024-06-12T17:59:49Z)
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity. We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.