MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
Long-Term Video Recognition
- URL: http://arxiv.org/abs/2201.08383v1
- Date: Thu, 20 Jan 2022 18:59:54 GMT
- Title: MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
Long-Term Video Recognition
- Authors: Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong,
Jitendra Malik, Christoph Feichtenhofer
- Abstract summary: We build a memory-augmented vision transformer that has a temporal support 30x longer than existing models.
MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets.
- Score: 74.35009770905968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While today's video recognition systems parse snapshots or short clips
accurately, they cannot connect the dots and reason across a longer range of
time yet. Most existing video architectures can only process <5 seconds of a
video without hitting the computation or memory bottlenecks.
In this paper, we propose a new strategy to overcome this challenge. Instead
of trying to process more frames at once like most existing methods, we propose
to process videos in an online fashion and cache "memory" at each iteration.
Through the memory, the model can reference prior context for long-term
modeling, with only a marginal cost. Based on this idea, we build MeMViT, a
Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x
longer than existing models with only 4.5% more compute; traditional methods
need >3,000% more compute to do the same. On a wide range of settings, the
increased temporal support enabled by MeMViT brings large gains in recognition
accuracy consistently. MeMViT obtains state-of-the-art results on the AVA,
EPIC-Kitchens-100 action classification, and action anticipation datasets. Code
and models will be made publicly available.
Related papers
- Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.
Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.
By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences.
SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity.
In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z) - MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding.
We propose to process videos in an online manner and store past video information in a memory bank.
Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.