Intention-Conditioned Long-Term Human Egocentric Action Forecasting
- URL: http://arxiv.org/abs/2207.12080v4
- Date: Mon, 8 Apr 2024 15:50:13 GMT
- Title: Intention-Conditioned Long-Term Human Egocentric Action Forecasting
- Authors: Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee,
- Abstract summary: We deal with Long-Term Action Anticipation task in egocentric videos.
By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term.
This work ranked first in both CVPR@2022 and ECVV@2022 EGO4D LTA Challenge.
- Score: 14.347147051922175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To anticipate how a human would act in the future, it is essential to understand the human intention since it guides the human towards a certain goal. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with Long-Term Action Anticipation task in egocentric videos. Our framework first extracts two level of human information over the N observed videos human actions through a Hierarchical Multi-task MLP Mixer (H3M). Then, we condition the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates K stable predictions of the next Z=20 actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over baseline methods in EGO4D Challenge. This work ranked first in both CVPR@2022 and ECVV@2022 EGO4D LTA Challenge by providing more plausible anticipated sequences, improving the anticipation of nouns and overall actions. Webpage: https://evm7.github.io/icvae-page/
Related papers
- An Epistemic Human-Aware Task Planner which Anticipates Human Beliefs and Decisions [8.309981857034902]
The aim is to build a robot policy that accounts for uncontrollable human behaviors.
We propose a novel planning framework and build a solver based on AND-OR search.
Preliminary experiments in two domains, one novel and one adapted, demonstrate the effectiveness of the framework.
arXiv Detail & Related papers (2024-09-27T08:27:36Z) - CoNav: A Benchmark for Human-Centered Collaborative Navigation [66.6268966718022]
We propose a collaborative navigation (CoNav) benchmark.
Our CoNav tackles the critical challenge of constructing a 3D navigation environment with realistic and diverse human activities.
We propose an intention-aware agent for reasoning both long-term and short-term human intention.
arXiv Detail & Related papers (2024-06-04T15:44:25Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI.
While this generative AI approach has produced impressive results, it heavily leans on human supervision.
This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation.
We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z) - Staged Contact-Aware Global Human Motion Forecasting [7.930326095134298]
Scene-aware global human motion forecasting is critical for manifold applications, including virtual reality, robotics, and sports.
We propose a STAGed contact-aware global human motion forecasting STAG, a novel three-stage pipeline for predicting global human motion in a 3D environment.
STAG achieves a 1.8% and 16.2% overall improvement in pose and trajectory prediction, respectively, on the scene-aware GTA-IM dataset.
arXiv Detail & Related papers (2023-09-16T10:47:48Z) - AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? [28.912026171231528]
Long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences.
We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal.
We propose a two-stage framework, AntGPT, which first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation
arXiv Detail & Related papers (2023-07-31T02:14:19Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - Generating Active Explicable Plans in Human-Robot Teaming [4.657875410615595]
It is important for robots to behave explicably by meeting the human's expectations.
Existing approaches to generating explicable plans often assume that the human's expectations are known and static.
We apply a Bayesian approach to model and predict dynamic human belief and expectations to make explicable planning more anticipatory.
arXiv Detail & Related papers (2021-09-18T05:05:50Z) - Probabilistic Human Motion Prediction via A Bayesian Neural Network [71.16277790708529]
We propose a probabilistic model for human motion prediction in this paper.
Our model could generate several future motions when given an observed motion sequence.
We extensively validate our approach on a large scale benchmark dataset Human3.6m.
arXiv Detail & Related papers (2021-07-14T09:05:33Z) - 3D Human motion anticipation and classification [8.069283749930594]
We propose a novel sequence-to-sequence model for human motion prediction and feature learning.
Our model learns to predict multiple future sequences of human poses from the same input sequence.
We show that it takes less than half the number of epochs to train an activity recognition network by using the feature learned from the discriminator.
arXiv Detail & Related papers (2020-12-31T00:19:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.