Related papers: When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

URL: http://arxiv.org/abs/2507.20198v3
Date: Wed, 30 Jul 2025 01:07:17 GMT
Title: When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Authors: Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang,
Abstract summary: Multimodal large language models (MLLMs) process increasingly long and complex contexts.<n> token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference.<n>We present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression.
Score: 27.220318661244242
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.

Related papers

LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression [63.83422894663496]
We propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC)<n>This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression.<n>Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements.
arXiv Detail & Related papers (2025-07-16T06:16:06Z)
Shifting AI Efficiency From Model-Centric to Data-Centric Compression [33.41504505470217]
We argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression.<n>We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference.
arXiv Detail & Related papers (2025-05-25T13:51:17Z)
Enhancing Long Video Generation Consistency without Tuning [92.1714656167712]
We address issues to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which judiciously edits the attention score matrix.<n>For videos generated by multiple prompts, we further uncover key factors such as the alignment of the prompts affecting prompt quality.<n>Inspired by our analyses, we propose PromptBlend, an advanced prompt pipeline that systematically aligns the prompts.
arXiv Detail & Related papers (2024-12-23T03:56:27Z)
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text.<n>This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content.<n>We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression [0.6345523830122168]
Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. We proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
arXiv Detail & Related papers (2024-04-16T02:29:00Z)
Deep Common Feature Mining for Efficient Video Semantic Segmentation [25.851900402539467]
We present Deep Common Feature Mining (DCFM) for video semantic segmentation.<n>DCFM explicitly decomposes features into two complementary components.<n>We incorporate a self-supervised loss function to reinforce intra-class feature similarity and enhance temporal consistency.
arXiv Detail & Related papers (2024-03-05T06:17:59Z)
AntPivot: Livestream Highlight Detection via Hierarchical Attention Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem. We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z)
Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture. We show that our method works for both classification and regression problems. We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.