Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
- URL: http://arxiv.org/abs/2503.16036v1
- Date: Thu, 20 Mar 2025 11:09:18 GMT
- Title: Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
- Authors: Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie,
- Abstract summary: We propose a Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom)<n>We use the instruction as a condition to guide the compression from both local and global levels.<n>Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens.
- Score: 36.16630765077807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.
Related papers
- LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models [62.240460476785934]
We propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder.<n>LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts.
arXiv Detail & Related papers (2025-07-03T03:42:54Z) - LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs [23.801172170798132]
LLaVA-Scissor is a training-free token compression strategy designed for multimodal large language models.<n>We propose to leverage the Semantic Connected Components ( SCC) approach to ensure comprehensive semantic coverage.<n>We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks.
arXiv Detail & Related papers (2025-06-27T02:29:58Z) - Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention [30.580674811560613]
Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing.<n>Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression.<n>We propose HyCo$$, which integrates both global and local perspectives to guide context compression.
arXiv Detail & Related papers (2025-05-21T17:26:11Z) - Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.
Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.
Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models [28.311125014789905]
"Global Compression Commander" (i.e., GlobalCom$2$) is a novel plug-and-play token compression framework for HR-LVLMs.<n>Our experiments show that GlobalCom$2$ maintains over 90% performance while compressing 90% visual tokens.
arXiv Detail & Related papers (2025-01-09T11:57:58Z) - ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench.<n>Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.
arXiv Detail & Related papers (2024-12-29T15:42:24Z) - Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You Need [53.584140947828004]
Language large model (LLM) with unprecedented intelligence is a general-purpose lossless compressor for various data modalities.
We propose P$2$-LLM, a next-pixel prediction-based LLM, which integrates various elaborated insights and methodologies.
Experiments on benchmark datasets demonstrate that P$2$-LLM can beat SOTA classical and learned codecs.
arXiv Detail & Related papers (2024-11-19T12:15:40Z) - Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios [17.720102137585503]
Perception is a training-free prompt compression framework for large language models.
It includes a perception retriever that leverages guiding questions and instruction to retrieve the most relevant demonstrations.
We conduct extensive experiments on long context, benchmarks, iSie, LongBench, and MuSiQue.
arXiv Detail & Related papers (2024-09-28T07:13:33Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - Efficient Large Multi-modal Models via Visual Context Compression [23.966237939194514]
We present the study on the analysis of redundancy concerning visual tokens and efficient training within large language models.
Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy.
We introduce Visual Context on the GQA benchmark, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance.
arXiv Detail & Related papers (2024-06-28T17:57:14Z) - VoCo-LLaMA: Towards Vision Compression with Large Language Models [31.398537194299752]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.<n>We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.<n>Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z) - Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs.
It targets effective, efficient, and flexible compression of long contexts.
It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z) - Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models.
LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods.
LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.