Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025
- URL: http://arxiv.org/abs/2506.02550v2
- Date: Wed, 11 Jun 2025 11:16:41 GMT
- Title: Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025
- Authors: Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie,
- Abstract summary: We present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation task.<n>Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation.<n>Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction.
- Score: 77.414837862995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at https://github.com/CorrineQiu/Ego4D-LTA-Challenge-2025.
Related papers
- Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation [52.6091162517921]
INSIGHT is a two-stage framework for egocentric action anticipation.<n>In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions.<n>In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning.
arXiv Detail & Related papers (2025-08-03T12:52:27Z) - Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction [41.63965006043724]
Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user's progress.<n>Recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding.<n>We identify two challenges in training large MLLMs for video-based planning tasks.
arXiv Detail & Related papers (2025-07-20T21:39:05Z) - VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning [59.68917139718813]
We show that a strong off-the-shelf frozen pretrained visual encoder can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning.
By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting.
arXiv Detail & Related papers (2024-10-04T14:52:09Z) - Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? [28.912026171231528]
Long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences.
We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal.
We propose a two-stage framework, AntGPT, which first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation
arXiv Detail & Related papers (2023-07-31T02:14:19Z) - Technical Report for Ego4D Long Term Action Anticipation Challenge 2023 [0.0]
We describe the technical details of our approach for the Ego4D Long-Term Action Anticipation Challenge 2023.
The aim of this task is to predict a sequence of future actions that will take place at an arbitrary time or later, given an input video.
Our method outperformed the baseline performance and recorded as second place solution on the public leaderboard.
arXiv Detail & Related papers (2023-07-04T04:12:49Z) - Palm: Predicting Actions through Language Models @ Ego4D Long-Term
Action Anticipation Challenge 2023 [100.32802766127776]
Palm is a solution to the Long-Term Action Anticipation task utilizing vision-language and large language models.
It predicts future actions based on frame descriptions and action labels extracted from the input videos.
arXiv Detail & Related papers (2023-06-28T20:33:52Z) - Egocentric Action Recognition by Video Attention and Temporal Context [83.57475598382146]
We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip.
Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
arXiv Detail & Related papers (2020-07-03T18:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.