Related papers: Anticipative Video Transformer

Anticipative Video Transformer

URL: http://arxiv.org/abs/2106.02036v1
Date: Thu, 3 Jun 2021 17:57:55 GMT
Title: Anticipative Video Transformer
Authors: Rohit Girdhar and Kristen Grauman
Abstract summary: Anticipative Video Transformer (AVT) is an end-to-end attention-based video modeling architecture. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features.
Score: 105.20878510342551
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies--both critical for the anticipation task. Through extensive experiments, we show that AVT obtains the best reported performance on four popular action anticipation benchmarks: EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads, including outperforming all submissions to the EpicKitchens-100 CVPR'21 challenge.

Related papers

Temporal Context Consistency Above All: Enhancing Long-Term Anticipation by Learning and Enforcing Temporal Constraints [4.880243880711163]
This paper proposes a method for predicting action labels and their duration in a video given the observation of an initial untrimmed video interval. We build on an encoder-decoder architecture with parallel decoding and make two key contributions. We validate our methods on four benchmark datasets for LTA, the EpicKitchen-55, EGTEA+, 50Salads and Breakfast.
arXiv Detail & Related papers (2024-12-27T03:29:10Z)
VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning [59.68917139718813]
We show that a strong off-the-shelf frozen pretrained visual encoder can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning. By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting.
arXiv Detail & Related papers (2024-10-04T14:52:09Z)
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z)
Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class. We decompose the video into short clips, where a visual encoder extracts features from each clip independently. Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z)
Multiscale Video Pretraining for Long-Term Activity Forecasting [67.06864386274736]
Multiscale Video Pretraining learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales. MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales. Our comprehensive experiments across the Ego4D and Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2023-07-24T14:55:15Z)
Rethinking Learning Approaches for Long-Term Action Anticipation [32.67768331823358]
Action anticipation involves predicting future actions having observed the initial portion of a video. We introduce ANTICIPATR which performs long-term action anticipation. We propose a two-stage learning approach to train a novel transformer-based model.
arXiv Detail & Related papers (2022-10-20T20:07:30Z)
NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022 [13.603712913129506]
We describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge. Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction. By averaging the prediction scores from a set of models compiled with our proposed training pipeline, we achieved strong performance on the test set, which is 19.61% overall mean top-5 recall, recorded as second place on the public leaderboard.
arXiv Detail & Related papers (2022-06-22T06:34:58Z)
The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction [104.628661890361]
Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales.
arXiv Detail & Related papers (2022-04-28T08:21:09Z)
Clockwork Variational Autoencoders [33.17951971728784]
We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences. We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets. We propose a Minecraft benchmark for long-term video prediction.
arXiv Detail & Related papers (2021-02-18T18:23:04Z)
Transformation-based Adversarial Video Prediction on Large-Scale Data [19.281817081571408]
We focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence. We first improve the state of the art by performing a systematic empirical study of discriminator decompositions. We then propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features.
arXiv Detail & Related papers (2020-03-09T10:52:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.