StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation
- URL: http://arxiv.org/abs/2304.03959v2
- Date: Mon, 18 Mar 2024 16:40:17 GMT
- Title: StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation
- Authors: Francesco Ragusa, Giovanni Maria Farinella, Antonino Furnari,
- Abstract summary: We study the short-term object interaction anticipation problem from the egocentric point of view.
Our approach simultaneously processes a still image and a video detecting and localizing next-active objects.
Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022.
- Score: 14.188006024550257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Anticipation problem has been studied considering different aspects such as predicting humans' locations, predicting hands and objects trajectories, and forecasting actions and human-object interactions. In this paper, we studied the short-term object interaction anticipation problem from the egocentric point of view, proposing a new end-to-end architecture named StillFast. Our approach simultaneously processes a still image and a video detecting and localizing next-active objects, predicting the verb which describes the future interaction and determining when the interaction will start. Experiments on the large-scale egocentric dataset EGO4D show that our method outperformed state-of-the-art approaches on the considered task. Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022. Please see the project web page for code and additional details: https://iplab.dmi.unict.it/stillfast/.
Related papers
- Interacted Object Grounding in Spatio-Temporal Human-Object Interactions [70.8859442754261]
We introduce a new open-world benchmark: Grounding Interacted Objects (GIO)
An object grounding task is proposed expecting vision systems to discover interacted objects.
We propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos.
arXiv Detail & Related papers (2024-12-27T09:08:46Z) - FIction: 4D Future Interaction Prediction from Video [63.37136159797888]
We introduce 4D future interaction prediction from videos.
Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period.
We show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially.
arXiv Detail & Related papers (2024-12-01T18:44:17Z) - Short-term Object Interaction Anticipation with Disentangled Object Detection @ Ego4D Short Term Object Interaction Anticipation Challenge [11.429137967096935]
Short-term object interaction anticipation is an important task in egocentric video analysis.
Our proposed method, SOIA-DOD, effectively decomposes it into 1) detecting active object and 2) classifying interaction and predicting their timing.
Our method first detects all potential active objects in the last frame of egocentric video by fine-tuning a pre-trained YOLOv9.
arXiv Detail & Related papers (2024-07-08T08:13:16Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - Anticipating Next Active Objects for Egocentric Videos [29.473527958651317]
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip.
We propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip.
arXiv Detail & Related papers (2023-02-13T13:44:52Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Look Wide and Interpret Twice: Improving Performance on Interactive
Instruction-following Tasks [29.671268927569063]
Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task.
This paper proposes a new method, which outperforms the previous methods by a large margin.
arXiv Detail & Related papers (2021-06-01T16:06:09Z) - Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks [75.06423516419862]
We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
arXiv Detail & Related papers (2020-08-05T17:48:00Z) - Egocentric Action Recognition by Video Attention and Temporal Context [83.57475598382146]
We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip.
Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
arXiv Detail & Related papers (2020-07-03T18:00:32Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.