Palm: Predicting Actions through Language Models @ Ego4D Long-Term
Action Anticipation Challenge 2023
- URL: http://arxiv.org/abs/2306.16545v1
- Date: Wed, 28 Jun 2023 20:33:52 GMT
- Title: Palm: Predicting Actions through Language Models @ Ego4D Long-Term
Action Anticipation Challenge 2023
- Authors: Daoji Huang, Otmar Hilliges, Luc Van Gool, Xi Wang
- Abstract summary: Palm is a solution to the Long-Term Action Anticipation task utilizing vision-language and large language models.
It predicts future actions based on frame descriptions and action labels extracted from the input videos.
- Score: 100.32802766127776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Palm, a solution to the Long-Term Action Anticipation (LTA) task
utilizing vision-language and large language models. Given an input video with
annotated action periods, the LTA task aims to predict possible future actions.
We hypothesize that an optimal solution should capture the interdependency
between past and future actions, and be able to infer future actions based on
the structure and dependency encoded in the past actions. Large language models
have demonstrated remarkable commonsense-based reasoning ability. Inspired by
that, Palm chains an image captioning model and a large language model. It
predicts future actions based on frame descriptions and action labels extracted
from the input videos. Our method outperforms other participants in the EGO4D
LTA challenge and achieves the best performance in terms of action prediction.
Our code is available at https://github.com/DanDoge/Palm
Related papers
- Human Motion Instruction Tuning [30.71209562108675]
This paper presents LLaMo, a framework for human motion instruction tuning.
LLaMo retains motion in its native form for instruction tuning.
By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis.
arXiv Detail & Related papers (2024-11-25T14:38:43Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Vamos: Versatile Action Models for Video Understanding [23.631145570126268]
We propose versatile action models (Vamos), a learning framework powered by a large language model as the reasoner''
We evaluate Vamos on five benchmarks, Ego4D, NeXT-QA, IntentQA, Spacewalk-18, and Ego on its capability to model temporal dynamics, encode visual history, and perform reasoning.
arXiv Detail & Related papers (2023-11-22T17:44:24Z) - AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? [28.912026171231528]
Long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences.
We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal.
We propose a two-stage framework, AntGPT, which first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation
arXiv Detail & Related papers (2023-07-31T02:14:19Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - What is More Likely to Happen Next? Video-and-Language Future Event
Prediction [111.93601253692165]
Given a video with aligned dialogue, people can often infer what is more likely to happen next.
In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions.
We collect a new dataset, named Video-and-Language Event Prediction, with 28,726 future event prediction examples.
arXiv Detail & Related papers (2020-10-15T19:56:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.