Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
- URL: http://arxiv.org/abs/2409.14485v4
- Date: Tue, 10 Dec 2024 12:45:31 GMT
- Title: Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
- Authors: Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, Bo Zhao,
- Abstract summary: Video-XL is a novel approach that leverages MLLMs' inherent key-value sparsification capacity to condense the visual input.
Video-XL's effectiveness is verified from three aspects. First, it achieves a superior long-video understanding capability, outperforming state-of-the-art models of comparable sizes.
- Score: 25.61734041983714
- License:
- Abstract: Long video understanding poses a significant challenge for current Multi-modal Large Language Models (MLLMs). Notably, the MLLMs are constrained by their limited context lengths and the substantial costs while processing long videos. Although several existing methods attempt to reduce visual tokens, their strategies encounter severe bottleneck, restricting MLLMs' ability to perceive fine-grained visual details. In this work, we propose Video-XL, a novel approach that leverages MLLMs' inherent key-value (KV) sparsification capacity to condense the visual input. Specifically, we introduce a new special token, the Visual Summarization Token (VST), for each interval of the video, which summarizes the visual information within the interval as its associated KV. The VST module is trained by instruction fine-tuning, where two optimizing strategies are offered. 1.Curriculum learning, where VST learns to make small (easy) and large compression (hard) progressively. 2. Composite data curation, which integrates single-image, multi-image, and synthetic data to overcome the scarcity of long-video instruction data. The compression quality is further improved by dynamic compression, which customizes compression granularity based on the information density of different video intervals. Video-XL's effectiveness is verified from three aspects. First, it achieves a superior long-video understanding capability, outperforming state-of-the-art models of comparable sizes across multiple popular benchmarks. Second, it effectively preserves video information, with minimal compression loss even at 16x compression ratio. Third, it realizes outstanding cost-effectiveness, enabling high-quality processing of thousands of frames on a single A100 GPU.
Related papers
- InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.
We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.
Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
We introduce a training-free method, $bfReTaKe$, to reduce both temporal visual redundancy and knowledge redundancy for long video understanding.
DPSelect identifies Videos with local maximum peak distance based on their visual features, which are closely aligned with human video perception.
PivotKV employs VideoBenchs as pivots and conducts KV-Cache compression for the non-text tokens with low attention scores.
arXiv Detail & Related papers (2024-12-29T15:42:24Z) - Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM [28.64108439552772]
We introduce a large-scale synthetic dataset created from proprietary models.
We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance.
Our proposed model achieves state-of-the-art results across various video tasks and shows impressive generalization.
arXiv Detail & Related papers (2024-12-12T18:20:41Z) - AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction [10.579335027350263]
AdaCM$2$ is an adaptive cross-modality memory reduction approach to video-text alignment on video streams.
It achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
arXiv Detail & Related papers (2024-11-19T18:04:13Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.