Related papers: Predicting the Next Action by Modeling the Abstract Goal

Predicting the Next Action by Modeling the Abstract Goal

URL: http://arxiv.org/abs/2209.05044v5
Date: Wed, 21 Aug 2024 02:51:36 GMT
Title: Predicting the Next Action by Modeling the Abstract Goal
Authors: Debaditya Roy, Basura Fernando,
Abstract summary: We present an action anticipation model that leverages goal information for the purpose of reducing the uncertainty in future predictions. We derive a novel concept called abstract goal which is conditioned on observed sequences of visual features for action anticipation. Our method obtains impressive results on the very challenging Epic-Kitchens55 (EK55), EK100, and EGTEA Gaze+ datasets.
Score: 18.873728614415946
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The problem of anticipating human actions is an inherently uncertain one. However, we can reduce this uncertainty if we have a sense of the goal that the actor is trying to achieve. Here, we present an action anticipation model that leverages goal information for the purpose of reducing the uncertainty in future predictions. Since we do not possess goal information or the observed actions during inference, we resort to visual representation to encapsulate information about both actions and goals. Through this, we derive a novel concept called abstract goal which is conditioned on observed sequences of visual features for action anticipation. We design the abstract goal as a distribution whose parameters are estimated using a variational recurrent network. We sample multiple candidates for the next action and introduce a goal consistency measure to determine the best candidate that follows from the abstract goal. Our method obtains impressive results on the very challenging Epic-Kitchens55 (EK55), EK100, and EGTEA Gaze+ datasets. We obtain absolute improvements of +13.69, +11.24, and +5.19 for Top-1 verb, Top-1 noun, and Top-1 action anticipation accuracy respectively over prior state-of-the-art methods for seen kitchens (S1) of EK55. Similarly, we also obtain significant improvements in the unseen kitchens (S2) set for Top-1 verb (+10.75), noun (+5.84) and action (+2.87) anticipation. Similar trend is observed for EGTEA Gaze+ dataset, where absolute improvement of +9.9, +13.1 and +6.8 is obtained for noun, verb, and action anticipation. It is through the submission of this paper that our method is currently the new state-of-the-art for action anticipation in EK55 and EGTEA Gaze+ https://competitions.codalab.org/competitions/20071#results Code available at https://github.com/debadityaroy/Abstract_Goal

Related papers

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning [59.68917139718813]
We show that a strong off-the-shelf frozen pretrained visual encoder can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning. By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting.
arXiv Detail & Related papers (2024-10-04T14:52:09Z)
Semantically Guided Representation Learning For Action Anticipation [9.836788915947924]
We propose the novel Semantically Guided Representation Learning (S-GEAR) framework. S-GEAR learns visual action prototypes and leverages language models to structure their relationship, inducing semanticity. We observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes.
arXiv Detail & Related papers (2024-07-02T14:44:01Z)
Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class. We decompose the video into short clips, where a visual encoder extracts features from each clip independently. Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z)
DiffAnt: Diffusion Models for Action Anticipation [12.022815981853071]
Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. In this work, we rethink action anticipation from a generative view, employing diffusion models to capture different possible future actions. Our code and trained models will be published on GitHub.
arXiv Detail & Related papers (2023-11-27T16:40:09Z)
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? [28.912026171231528]
Long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We propose a two-stage framework, AntGPT, which first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation
arXiv Detail & Related papers (2023-07-31T02:14:19Z)
Action Anticipation with Goal Consistency [19.170733994203367]
We propose to harness high-level intent information to anticipate actions that will take place in the future. We show the effectiveness of the proposed approach and demonstrate that our method achieves state-of-the-art results on two large-scale datasets.
arXiv Detail & Related papers (2023-06-26T20:04:23Z)
NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022 [13.603712913129506]
We describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge. Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction. By averaging the prediction scores from a set of models compiled with our proposed training pipeline, we achieved strong performance on the test set, which is 19.61% overall mean top-5 recall, recorded as second place on the public leaderboard.
arXiv Detail & Related papers (2022-06-22T06:34:58Z)
The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction [104.628661890361]
Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales.
arXiv Detail & Related papers (2022-04-28T08:21:09Z)
Anticipative Video Transformer [105.20878510342551]
Anticipative Video Transformer (AVT) is an end-to-end attention-based video modeling architecture. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features.
arXiv Detail & Related papers (2021-06-03T17:57:55Z)
Panoptic Segmentation Forecasting [71.75275164959953]
Our goal is to forecast the near future given a set of recent observations. We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents. We develop a two-component model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things.
arXiv Detail & Related papers (2021-04-08T17:59:16Z)
Learning to Anticipate Egocentric Actions by Imagination [60.21323541219304]
We study the egocentric action anticipation task, which predicts future action seconds before it is performed for egocentric videos. Our method significantly outperforms previous methods on both the seen test set and the unseen test set of the EPIC Kitchens Action Anticipation Challenge.
arXiv Detail & Related papers (2021-01-13T08:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.