Enhancing Next Active Object-based Egocentric Action Anticipation with
Guided Attention
- URL: http://arxiv.org/abs/2305.12953v2
- Date: Fri, 23 Jun 2023 15:34:13 GMT
- Title: Enhancing Next Active Object-based Egocentric Action Anticipation with
Guided Attention
- Authors: Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio
Del Bue
- Abstract summary: Short-term action anticipation (STA) in first-person videos is a challenging task.
We propose a novel approach that applies a guided attention mechanism between objects.
Our method, GANO, is a multi-modal, end-to-end, single transformer-based network.
- Score: 45.60789439017625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Short-term action anticipation (STA) in first-person videos is a challenging
task that involves understanding the next active object interactions and
predicting future actions. Existing action anticipation methods have primarily
focused on utilizing features extracted from video clips, but often overlooked
the importance of objects and their interactions. To this end, we propose a
novel approach that applies a guided attention mechanism between the objects,
and the spatiotemporal features extracted from video clips, enhancing the
motion and contextual information, and further decoding the object-centric and
motion-centric information to address the problem of STA in egocentric videos.
Our method, GANO (Guided Attention for Next active Objects) is a multi-modal,
end-to-end, single transformer-based network. The experimental results
performed on the largest egocentric dataset demonstrate that GANO outperforms
the existing state-of-the-art methods for the prediction of the next active
object label, its bounding box location, the corresponding future action, and
the time to contact the object. The ablation study shows the positive
contribution of the guided attention mechanism compared to other fusion
methods. Moreover, it is possible to improve the next active object location
and class label prediction results of GANO by just appending the learnable
object tokens with the region of interest embeddings.
Related papers
- Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [59.87033229815062]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.
Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.
We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z) - Short-term Object Interaction Anticipation with Disentangled Object Detection @ Ego4D Short Term Object Interaction Anticipation Challenge [11.429137967096935]
Short-term object interaction anticipation is an important task in egocentric video analysis.
Our proposed method, SOIA-DOD, effectively decomposes it into 1) detecting active object and 2) classifying interaction and predicting their timing.
Our method first detects all potential active objects in the last frame of egocentric video by fine-tuning a pre-trained YOLOv9.
arXiv Detail & Related papers (2024-07-08T08:13:16Z) - Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks.
Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z) - Object-centric Video Representation for Long-term Action Anticipation [33.115854386196126]
Key motivation is that objects provide important cues to recognize and predict human-object interactions.
We propose to build object-centric video representations by leveraging visual-language pretrained models.
To recognize and predict human-object interactions, we use a Transformer-based neural architecture.
arXiv Detail & Related papers (2023-10-31T22:54:31Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - Event-Free Moving Object Segmentation from Moving Ego Vehicle [88.33470650615162]
Moving object segmentation (MOS) in dynamic scenes is an important, challenging, but under-explored research topic for autonomous driving.
Most segmentation methods leverage motion cues obtained from optical flow maps.
We propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow.
arXiv Detail & Related papers (2023-04-28T23:43:10Z) - Anticipating Next Active Objects for Egocentric Videos [29.473527958651317]
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip.
We propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip.
arXiv Detail & Related papers (2023-02-13T13:44:52Z) - Recent Advances in Embedding Methods for Multi-Object Tracking: A Survey [71.10448142010422]
Multi-object tracking (MOT) aims to associate target objects across video frames in order to obtain entire moving trajectories.
Embedding methods play an essential role in object location estimation and temporal identity association in MOT.
We first conduct a comprehensive overview with in-depth analysis for embedding methods in MOT from seven different perspectives.
arXiv Detail & Related papers (2022-05-22T06:54:33Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks [75.06423516419862]
We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
arXiv Detail & Related papers (2020-08-05T17:48:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.