Rethinking Learning Approaches for Long-Term Action Anticipation
- URL: http://arxiv.org/abs/2210.11566v1
- Date: Thu, 20 Oct 2022 20:07:30 GMT
- Title: Rethinking Learning Approaches for Long-Term Action Anticipation
- Authors: Megha Nawhal, Akash Abdu Jyothi, Greg Mori
- Abstract summary: Action anticipation involves predicting future actions having observed the initial portion of a video.
We introduce ANTICIPATR which performs long-term action anticipation.
We propose a two-stage learning approach to train a novel transformer-based model.
- Score: 32.67768331823358
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Action anticipation involves predicting future actions having observed the
initial portion of a video. Typically, the observed video is processed as a
whole to obtain a video-level representation of the ongoing activity in the
video, which is then used for future prediction. We introduce ANTICIPATR which
performs long-term action anticipation leveraging segment-level representations
learned using individual segments from different activities, in addition to a
video-level representation. We propose a two-stage learning approach to train a
novel transformer-based model that uses these two types of representations to
directly predict a set of future action instances over any given anticipation
duration. Results on Breakfast, 50Salads, Epic-Kitchens-55, and EGTEA Gaze+
datasets demonstrate the effectiveness of our approach.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning [59.68917139718813]
We show that a strong off-the-shelf frozen pretrained visual encoder can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning.
By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting.
arXiv Detail & Related papers (2024-10-04T14:52:09Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Temporal DINO: A Self-supervised Video Strategy to Enhance Action
Prediction [15.696593695918844]
This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels)
The experimental results showcase significant improvements in prediction performance across 3D-ResNet, Transformer, and LSTM architectures.
These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.
arXiv Detail & Related papers (2023-08-08T21:18:23Z) - Future Transformer for Long-term Action Anticipation [33.771374384674836]
We propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR)
Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding.
We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads, achieving state-of-the-art results.
arXiv Detail & Related papers (2022-05-27T14:47:43Z) - The Wisdom of Crowds: Temporal Progressive Attention for Early Action
Prediction [104.628661890361]
Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video.
We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales.
arXiv Detail & Related papers (2022-04-28T08:21:09Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Anticipative Video Transformer [105.20878510342551]
Anticipative Video Transformer (AVT) is an end-to-end attention-based video modeling architecture.
We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features.
arXiv Detail & Related papers (2021-06-03T17:57:55Z) - Long-Term Anticipation of Activities with Cycle Consistency [90.79357258104417]
We propose a framework for anticipating future activities directly from the features of the observed frames and train it in an end-to-end fashion.
Our framework achieves state-the-art results on two datasets: the Breakfast dataset and 50Salads.
arXiv Detail & Related papers (2020-09-02T15:41:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.