StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation
- URL: http://arxiv.org/abs/2304.03959v2
- Date: Mon, 18 Mar 2024 16:40:17 GMT
- Title: StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation
- Authors: Francesco Ragusa, Giovanni Maria Farinella, Antonino Furnari,
- Abstract summary: We study the short-term object interaction anticipation problem from the egocentric point of view.
Our approach simultaneously processes a still image and a video detecting and localizing next-active objects.
Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022.
- Score: 14.188006024550257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Anticipation problem has been studied considering different aspects such as predicting humans' locations, predicting hands and objects trajectories, and forecasting actions and human-object interactions. In this paper, we studied the short-term object interaction anticipation problem from the egocentric point of view, proposing a new end-to-end architecture named StillFast. Our approach simultaneously processes a still image and a video detecting and localizing next-active objects, predicting the verb which describes the future interaction and determining when the interaction will start. Experiments on the large-scale egocentric dataset EGO4D show that our method outperformed state-of-the-art approaches on the considered task. Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022. Please see the project web page for code and additional details: https://iplab.dmi.unict.it/stillfast/.
Related papers
- Short-term Object Interaction Anticipation with Disentangled Object Detection @ Ego4D Short Term Object Interaction Anticipation Challenge [11.429137967096935]
Short-term object interaction anticipation is an important task in egocentric video analysis.
Our proposed method, SOIA-DOD, effectively decomposes it into 1) detecting active object and 2) classifying interaction and predicting their timing.
Our method first detects all potential active objects in the last frame of egocentric video by fine-tuning a pre-trained YOLOv9.
arXiv Detail & Related papers (2024-07-08T08:13:16Z) - AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation [14.734158936250918]
Short-Term object-interaction Anticipation is fundamental for wearable assistants or human robot interaction to understand user goals.
We improve the performance of STA predictions with two contributions.
First, we propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion.
Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot.
arXiv Detail & Related papers (2024-06-03T10:57:18Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - Anticipating Next Active Objects for Egocentric Videos [29.473527958651317]
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip.
We propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip.
arXiv Detail & Related papers (2023-02-13T13:44:52Z) - Investigating Pose Representations and Motion Contexts Modeling for 3D
Motion Prediction [63.62263239934777]
We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task.
We propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction.
Our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency.
arXiv Detail & Related papers (2021-12-30T10:45:22Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Look Wide and Interpret Twice: Improving Performance on Interactive
Instruction-following Tasks [29.671268927569063]
Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task.
This paper proposes a new method, which outperforms the previous methods by a large margin.
arXiv Detail & Related papers (2021-06-01T16:06:09Z) - Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks [75.06423516419862]
We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
arXiv Detail & Related papers (2020-08-05T17:48:00Z) - Egocentric Action Recognition by Video Attention and Temporal Context [83.57475598382146]
We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip.
Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
arXiv Detail & Related papers (2020-07-03T18:00:32Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.