MovieChat: From Dense Token to Sparse Memory for Long Video
Understanding
- URL: http://arxiv.org/abs/2307.16449v4
- Date: Sat, 9 Mar 2024 06:43:37 GMT
- Title: MovieChat: From Dense Token to Sparse Memory for Long Video
Understanding
- Authors: Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou,
Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng
Hwang, Gaoang Wang
- Abstract summary: MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations.
- Score: 38.504994472886786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, integrating video foundation models and large language models to
build a video understanding system can overcome the limitations of specific
pre-defined vision tasks. Yet, existing systems can only handle videos with
very few frames. For long videos, the computation complexity, memory cost, and
long-term temporal connection impose additional challenges. Taking advantage of
the Atkinson-Shiffrin memory model, with tokens in Transformers being employed
as the carriers of memory in combination with our specially designed memory
mechanism, we propose the MovieChat to overcome these challenges. MovieChat
achieves state-of-the-art performance in long video understanding, along with
the released MovieChat-1K benchmark with 1K long video and 14K manual
annotations for validation of the effectiveness of our method.
Related papers
- ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.
We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.
We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input [34.50993235961505]
Kangaroo is a powerful Video LMM aimed at addressing the challenges of processing long videos.
Data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning.
curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos.
arXiv Detail & Related papers (2024-08-28T05:34:14Z) - Hierarchical Memory for Long Video QA [78.72965584414368]
This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA)
We adopt a hierarchical memory mechanism named STAR Memory, that is capable of processing long videos with limited GPU memory (VRAM)
We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge.
arXiv Detail & Related papers (2024-06-30T06:08:12Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - MovieChat+: Question-aware Sparse Memory for Long Video Question Answering [36.14140811797466]
We propose MovieChat to overcome the challenges of understanding long videos.
We use tokens in Transformers as the carriers of memory in combination with our specially designed memory mechanism.
MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method.
arXiv Detail & Related papers (2024-04-26T06:17:04Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z) - LVCHAT: Facilitating Long Video Comprehension [25.395689904747965]
We propose Long Video Chat (LVChat) to enable multimodal large language models (LLMs) to read videos.
LV significantly outperforms existing methods by up to 27% in accuracy on long-video QA datasets and long-video captioning benchmarks.
arXiv Detail & Related papers (2024-02-19T11:59:14Z) - MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
Long-Term Video Recognition [74.35009770905968]
We build a memory-augmented vision transformer that has a temporal support 30x longer than existing models.
MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets.
arXiv Detail & Related papers (2022-01-20T18:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.