MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
Long-Term Video Recognition
- URL: http://arxiv.org/abs/2201.08383v1
- Date: Thu, 20 Jan 2022 18:59:54 GMT
- Title: MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
Long-Term Video Recognition
- Authors: Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong,
Jitendra Malik, Christoph Feichtenhofer
- Abstract summary: We build a memory-augmented vision transformer that has a temporal support 30x longer than existing models.
MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets.
- Score: 74.35009770905968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While today's video recognition systems parse snapshots or short clips
accurately, they cannot connect the dots and reason across a longer range of
time yet. Most existing video architectures can only process <5 seconds of a
video without hitting the computation or memory bottlenecks.
In this paper, we propose a new strategy to overcome this challenge. Instead
of trying to process more frames at once like most existing methods, we propose
to process videos in an online fashion and cache "memory" at each iteration.
Through the memory, the model can reference prior context for long-term
modeling, with only a marginal cost. Based on this idea, we build MeMViT, a
Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x
longer than existing models with only 4.5% more compute; traditional methods
need >3,000% more compute to do the same. On a wide range of settings, the
increased temporal support enabled by MeMViT brings large gains in recognition
accuracy consistently. MeMViT obtains state-of-the-art results on the AVA,
EPIC-Kitchens-100 action classification, and action anticipation datasets. Code
and models will be made publicly available.
Related papers
- VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences.
SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity.
In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z) - MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding.
We propose to process videos in an online manner and store past video information in a memory bank.
Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Memory Efficient Temporal & Visual Graph Model for Unsupervised Video
Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos.
We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z) - TALLFormer: Temporal Action Localization with Long-memory Transformer [16.208160001820044]
TALLFormer is a memory-efficient and end-to-end trainable temporal action localization transformer.
Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration.
With only RGB frames as input, TALLFormer outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-04-04T17:51:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.