AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
- URL: http://arxiv.org/abs/2307.16368v3
- Date: Mon, 1 Apr 2024 01:33:53 GMT
- Title: AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
- Authors: Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun,
- Abstract summary: Long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences.
We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal.
We propose a two-stage framework, AntGPT, which first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation
- Score: 28.912026171231528
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT
Related papers
- ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions [66.20773952864802]
We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images.
We propose ActionCOMET, a framework to discern knowledge present in language models specific to the provided visual input.
arXiv Detail & Related papers (2024-10-17T15:22:57Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Palm: Predicting Actions through Language Models @ Ego4D Long-Term
Action Anticipation Challenge 2023 [100.32802766127776]
Palm is a solution to the Long-Term Action Anticipation task utilizing vision-language and large language models.
It predicts future actions based on frame descriptions and action labels extracted from the input videos.
arXiv Detail & Related papers (2023-06-28T20:33:52Z) - Rethinking Learning Approaches for Long-Term Action Anticipation [32.67768331823358]
Action anticipation involves predicting future actions having observed the initial portion of a video.
We introduce ANTICIPATR which performs long-term action anticipation.
We propose a two-stage learning approach to train a novel transformer-based model.
arXiv Detail & Related papers (2022-10-20T20:07:30Z) - Predicting the Next Action by Modeling the Abstract Goal [18.873728614415946]
We present an action anticipation model that leverages goal information for the purpose of reducing the uncertainty in future predictions.
We derive a novel concept called abstract goal which is conditioned on observed sequences of visual features for action anticipation.
Our method obtains impressive results on the very challenging Epic-Kitchens55 (EK55), EK100, and EGTEA Gaze+ datasets.
arXiv Detail & Related papers (2022-09-12T06:52:42Z) - Intention-Conditioned Long-Term Human Egocentric Action Forecasting [14.347147051922175]
We deal with Long-Term Action Anticipation task in egocentric videos.
By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term.
This work ranked first in both CVPR@2022 and ECVV@2022 EGO4D LTA Challenge.
arXiv Detail & Related papers (2022-07-25T11:57:01Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - Future Transformer for Long-term Action Anticipation [33.771374384674836]
We propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR)
Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding.
We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads, achieving state-of-the-art results.
arXiv Detail & Related papers (2022-05-27T14:47:43Z) - The Wisdom of Crowds: Temporal Progressive Attention for Early Action
Prediction [104.628661890361]
Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video.
We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales.
arXiv Detail & Related papers (2022-04-28T08:21:09Z) - Learning to Anticipate Egocentric Actions by Imagination [60.21323541219304]
We study the egocentric action anticipation task, which predicts future action seconds before it is performed for egocentric videos.
Our method significantly outperforms previous methods on both the seen test set and the unseen test set of the EPIC Kitchens Action Anticipation Challenge.
arXiv Detail & Related papers (2021-01-13T08:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.