Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video
- URL: http://arxiv.org/abs/2005.02190v2
- Date: Fri, 8 May 2020 13:56:58 GMT
- Title: Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video
- Authors: Antonino Furnari and Giovanni Maria Farinella
- Abstract summary: Rolling-Unrolling LSTM is a learning architecture to anticipate actions from egocentric videos.
The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and ActivityNet.
- Score: 27.391434284586985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we tackle the problem of egocentric action anticipation, i.e.,
predicting what actions the camera wearer will perform in the near future and
which objects they will interact with. Specifically, we contribute
Rolling-Unrolling LSTM, a learning architecture to anticipate actions from
egocentric videos. The method is based on three components: 1) an architecture
comprised of two LSTMs to model the sub-tasks of summarizing the past and
inferring the future, 2) a Sequence Completion Pre-Training technique which
encourages the LSTMs to focus on the different sub-tasks, and 3) a Modality
ATTention (MATT) mechanism to efficiently fuse multi-modal predictions
performed by processing RGB frames, optical flow fields and object-based
features. The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and
ActivityNet. The experiments show that the proposed architecture is
state-of-the-art in the domain of egocentric videos, achieving top performances
in the 2019 EPIC-Kitchens egocentric action anticipation challenge. The
approach also achieves competitive performance on ActivityNet with respect to
methods not based on unsupervised pre-training and generalizes to the tasks of
early action recognition and action recognition. To encourage research on this
challenging topic, we made our code, trained models, and pre-extracted features
available at our web page: http://iplab.dmi.unict.it/rulstm.
Related papers
- Short-term Object Interaction Anticipation with Disentangled Object Detection @ Ego4D Short Term Object Interaction Anticipation Challenge [11.429137967096935]
Short-term object interaction anticipation is an important task in egocentric video analysis.
Our proposed method, SOIA-DOD, effectively decomposes it into 1) detecting active object and 2) classifying interaction and predicting their timing.
Our method first detects all potential active objects in the last frame of egocentric video by fine-tuning a pre-trained YOLOv9.
arXiv Detail & Related papers (2024-07-08T08:13:16Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - JOADAA: joint online action detection and action anticipation [2.7792814152937027]
Action anticipation involves forecasting future actions by connecting past events to future ones.
Online action detection is the task of predicting actions in a streaming manner.
By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information.
arXiv Detail & Related papers (2023-09-12T11:17:25Z) - Temporal DINO: A Self-supervised Video Strategy to Enhance Action
Prediction [15.696593695918844]
This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels)
The experimental results showcase significant improvements in prediction performance across 3D-ResNet, Transformer, and LSTM architectures.
These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.
arXiv Detail & Related papers (2023-08-08T21:18:23Z) - Anticipating Next Active Objects for Egocentric Videos [29.473527958651317]
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip.
We propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip.
arXiv Detail & Related papers (2023-02-13T13:44:52Z) - Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and
Heuristic Rule-based Methods for Object Manipulation [118.27432851053335]
This paper presents an overview and comparative analysis of our systems designed for the following two tracks in SAPIEN ManiSkill Challenge 2021: No Interaction Track.
The No Interaction track targets for learning policies from pre-collected demonstration trajectories.
In this track, we design a Heuristic Rule-based Method (HRM) to trigger high-quality object manipulation by decomposing the task into a series of sub-tasks.
For each sub-task, the simple rule-based controlling strategies are adopted to predict actions that can be applied to robotic arms.
arXiv Detail & Related papers (2022-06-13T16:20:42Z) - Learning to Anticipate Future with Dynamic Context Removal [47.478225043001665]
Anticipating future events is an essential feature for intelligent systems and embodied AI.
We propose a novel training scheme called Dynamic Context Removal (DCR), which dynamically schedules the visibility of observed future in the learning procedure.
Our learning scheme is plug-and-play and easy to integrate any reasoning model including transformer and LSTM, with advantages in both effectiveness and efficiency.
arXiv Detail & Related papers (2022-04-06T05:24:28Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - Learning to Anticipate Egocentric Actions by Imagination [60.21323541219304]
We study the egocentric action anticipation task, which predicts future action seconds before it is performed for egocentric videos.
Our method significantly outperforms previous methods on both the seen test set and the unseen test set of the EPIC Kitchens Action Anticipation Challenge.
arXiv Detail & Related papers (2021-01-13T08:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.