Related papers: Visual Context Window Extension: A New Perspective for Long Video Understanding

Visual Context Window Extension: A New Perspective for Long Video Understanding

URL: http://arxiv.org/abs/2409.20018v2
Date: Wed, 2 Oct 2024 09:34:11 GMT
Title: Visual Context Window Extension: A New Perspective for Long Video Understanding
Authors: Hongchen Wei, Zhenzhong Chen,
Abstract summary: We tackle the challenge of long video understanding from the perspective of context windows. We propose to adapt LMMs for long video understanding tasks by extending the visual context window. Our method consistently improves the performance as the number of video frames increases.
Score: 45.134271969594614
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.

Related papers

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning [66.24870234484668]
We introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories.<n>WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks.
arXiv Detail & Related papers (2025-12-02T05:14:52Z)
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding [56.45689495743107]
Vgent is a graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding.<n>We evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks.
arXiv Detail & Related papers (2025-10-15T19:14:58Z)
Video Panels for Long Video Understanding [25.560912635941662]
We propose a novel visual prompting strategy specifically designed for long-video understanding.<n>By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution.<n>Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing Video-Language Models.
arXiv Detail & Related papers (2025-09-28T08:05:55Z)
ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z)
AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction [10.579335027350263]
AdaCM$2$ is an adaptive cross-modality memory reduction approach to video-text alignment on video streams. It achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
arXiv Detail & Related papers (2024-11-19T18:04:13Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding [52.696422425058245]
MultiModal Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks. Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding.
arXiv Detail & Related papers (2024-09-27T17:38:36Z)
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding [26.72068455284472]
Video-XL is an extra-long vision language model designed for efficient hour-scale video understanding. Our model achieves promising results on popular long video understanding benchmarks.
arXiv Detail & Related papers (2024-09-22T15:13:31Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. In this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z)
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding. We propose to process videos in an online manner and store past video information in a memory bank. Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z)
LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding. We encode video representations that incorporate both local and global information. Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
Long Video Understanding with Learnable Retrieval in Video-Language Models [36.793956806567834]
We introduce a learnable retrieval-based video-language model (R-VLM) for efficient long video understanding. Specifically, given a question (Query) and a long video, our model identifies and selects the most relevant K video chunks. This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance.
arXiv Detail & Related papers (2023-12-08T09:48:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.