MovieChat: From Dense Token to Sparse Memory for Long Video
Understanding
- URL: http://arxiv.org/abs/2307.16449v4
- Date: Sat, 9 Mar 2024 06:43:37 GMT
- Title: MovieChat: From Dense Token to Sparse Memory for Long Video
Understanding
- Authors: Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou,
Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng
Hwang, Gaoang Wang
- Abstract summary: MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations.
- Score: 38.504994472886786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, integrating video foundation models and large language models to
build a video understanding system can overcome the limitations of specific
pre-defined vision tasks. Yet, existing systems can only handle videos with
very few frames. For long videos, the computation complexity, memory cost, and
long-term temporal connection impose additional challenges. Taking advantage of
the Atkinson-Shiffrin memory model, with tokens in Transformers being employed
as the carriers of memory in combination with our specially designed memory
mechanism, we propose the MovieChat to overcome these challenges. MovieChat
achieves state-of-the-art performance in long video understanding, along with
the released MovieChat-1K benchmark with 1K long video and 14K manual
annotations for validation of the effectiveness of our method.
Related papers
- $\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation [19.616624959353697]
$infty$-Video can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism.
Our framework augments video Q-formers by allowing them to process video contexts efficiently and without requiring additional training.
arXiv Detail & Related papers (2025-01-31T12:45:46Z) - VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation.
HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level.
VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scale.
arXiv Detail & Related papers (2024-12-31T18:01:23Z) - Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark [5.76230561819199]
We propose Repurpose-10K, an extensive dataset comprising over 10,000 videos with more than 120,000 annotated clips.
We propose a two-stage solution to obtain annotations from real-world user-generated content.
We offer a baseline model to address this challenging task by integrating audio, visual, and caption aspects.
arXiv Detail & Related papers (2024-12-12T02:27:46Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.
Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.
By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - Hierarchical Memory for Long Video QA [78.72965584414368]
This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA)
We adopt a hierarchical memory mechanism named STAR Memory, that is capable of processing long videos with limited GPU memory (VRAM)
We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge.
arXiv Detail & Related papers (2024-06-30T06:08:12Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - MovieChat+: Question-aware Sparse Memory for Long Video Question Answering [36.14140811797466]
We propose MovieChat to overcome the challenges of understanding long videos.
We use tokens in Transformers as the carriers of memory in combination with our specially designed memory mechanism.
MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method.
arXiv Detail & Related papers (2024-04-26T06:17:04Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.