Related papers: DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

URL: http://arxiv.org/abs/2507.10302v1
Date: Mon, 14 Jul 2025 14:05:19 GMT
Title: DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs
Authors: Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao,
Abstract summary: DisCo is a visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs.<n>DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks.
Score: 28.998923104606614
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The code: https://github.com/ZJHTerry18/DisCo.

Related papers

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration [31.27071437510817]
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens.<n>We propose VisionTrim, a unified framework for training-free MLLM acceleration.
arXiv Detail & Related papers (2026-01-30T07:45:48Z)
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs [28.026438743789907]
VideoScaffold is a dynamic representation framework designed for streaming video understanding.<n>It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics.<n>The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension.
arXiv Detail & Related papers (2025-12-23T03:33:45Z)
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders [62.58375366359421]
Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem.<n>Traditional uniform sampling leads to selection of irrelevant content.<n>Post-training MLLMs on thousands of frames imposes a substantial computational burden.<n>We propose threadings with narratives (Nar-KFC) to facilitate effective and efficient long video perception.
arXiv Detail & Related papers (2025-05-30T03:04:28Z)
Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC)<n>We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders.<n>To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
arXiv Detail & Related papers (2025-04-14T17:34:06Z)
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.<n>QuoTA strategically allocates frame-level importance scores based on query relevance.<n>We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections. We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.