SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric
Videos
- URL: http://arxiv.org/abs/2109.00829v1
- Date: Thu, 2 Sep 2021 10:20:18 GMT
- Title: SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric
Videos
- Authors: Nada Osman, Guglielmo Camporese, Pasquale Coscia, Lamberto Ballan
- Abstract summary: We build upon RULSTM architecture, which is specifically designed for anticipating human actions.
We propose a novel attention-based technique to evaluate, simultaneously, slow and fast features extracted from three different modalities.
Two branches process information at different time scales, i.e., frame-rates, and several fusion schemes are considered to improve prediction accuracy.
- Score: 2.6572330982240935
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Action anticipation in egocentric videos is a difficult task due to the
inherently multi-modal nature of human actions. Additionally, some actions
happen faster or slower than others depending on the actor or surrounding
context which could vary each time and lead to different predictions. Based on
this idea, we build upon RULSTM architecture, which is specifically designed
for anticipating human actions, and propose a novel attention-based technique
to evaluate, simultaneously, slow and fast features extracted from three
different modalities, namely RGB, optical flow, and extracted objects. Two
branches process information at different time scales, i.e., frame-rates, and
several fusion schemes are considered to improve prediction accuracy. We
perform extensive experiments on EpicKitchens-55 and EGTEA Gaze+ datasets, and
demonstrate that our technique systematically improves the results of RULSTM
architecture for Top-5 accuracy metric at different anticipation times.
Related papers
- A Time Series is Worth Five Experts: Heterogeneous Mixture of Experts for Traffic Flow Prediction [9.273632869779929]
We propose a Heterogeneous Mixture of Experts (TITAN) model for traffic flow prediction.
Experiments on two public traffic network datasets, METR-LA and P-BAY, demonstrate that TITAN effectively captures variable-centric dependencies.
It achieves improvements in all evaluation metrics, ranging from approximately 4.37% to 11.53%, compared to previous state-of-the-art (SOTA) models.
arXiv Detail & Related papers (2024-09-26T00:26:47Z) - The Art of Imitation: Learning Long-Horizon Manipulation Tasks from Few Demonstrations [13.747258771184372]
There are several open challenges to applying TP-GMMs in the wild.
We factorize the robot's end-effector velocity into its direction and magnitude.
We then segment and sequence skills from complex demonstration trajectories.
Our approach enables learning complex manipulation tasks from just five demonstrations.
arXiv Detail & Related papers (2024-07-18T12:01:09Z) - SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation.
We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Investigating Pose Representations and Motion Contexts Modeling for 3D
Motion Prediction [63.62263239934777]
We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task.
We propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction.
Our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency.
arXiv Detail & Related papers (2021-12-30T10:45:22Z) - Multi-Modal Temporal Convolutional Network for Anticipating Actions in
Egocentric Videos [22.90184887794109]
Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process.
This poses a problem for domains such as autonomous driving, where the reaction time is crucial.
We propose a simple and effective multi-modal architecture based on temporal convolutions.
arXiv Detail & Related papers (2021-07-18T16:21:35Z) - SDMTL: Semi-Decoupled Multi-grained Trajectory Learning for 3D human
motion prediction [5.581663772616127]
We propose a novel end-to-end network, Semi-Decoupled Multi-grained Trajectory Learning network, to predict future human motion.
Specifically, we capture the temporal dynamics of motion trajectory at multi-granularity, including fine granularity and coarse.
We learn multi-grained trajectory information using BSMEs hierarchically and further capture the information of temporal evolutional directions at each granularity.
arXiv Detail & Related papers (2020-10-11T01:29:21Z) - Motion Prediction Using Temporal Inception Module [96.76721173517895]
We propose a Temporal Inception Module (TIM) to encode human motion.
Our framework produces input embeddings using convolutional layers, by using different kernel sizes for different input lengths.
The experimental results on standard motion prediction benchmark datasets Human3.6M and CMU motion capture dataset show that our approach consistently outperforms the state of the art methods.
arXiv Detail & Related papers (2020-10-06T20:26:01Z) - Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video [27.391434284586985]
Rolling-Unrolling LSTM is a learning architecture to anticipate actions from egocentric videos.
The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and ActivityNet.
arXiv Detail & Related papers (2020-05-04T14:13:41Z) - Multi-Task Learning for Dense Prediction Tasks: A Survey [87.66280582034838]
Multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint.
We provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision.
arXiv Detail & Related papers (2020-04-28T09:15:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.