Related papers: MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

URL: http://arxiv.org/abs/2201.08383v1
Date: Thu, 20 Jan 2022 18:59:54 GMT
Title: MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
Authors: Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer
Abstract summary: We build a memory-augmented vision transformer that has a temporal support 30x longer than existing models. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets.
Score: 74.35009770905968
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models will be made publicly available.

Related papers

DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification [4.973664680272982]
DejaVid is an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture.<n>We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task.<n>Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder.
arXiv Detail & Related papers (2025-06-14T17:39:03Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences. SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z)
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding. We propose to process videos in an online manner and store past video information in a memory bank. Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z)
Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion. Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z)
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Memory Efficient Temporal & Visual Graph Model for Unsupervised Video Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos. We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z)
TALLFormer: Temporal Action Localization with Long-memory Transformer [16.208160001820044]
TALLFormer is a memory-efficient and end-to-end trainable temporal action localization transformer. Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration. With only RGB frames as input, TALLFormer outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-04-04T17:51:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.