Related papers: ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

URL: http://arxiv.org/abs/2407.04369v1
Date: Fri, 5 Jul 2024 09:16:30 GMT
Title: ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA
Authors: Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero-Campo, Giovanni Maria Farinella,
Abstract summary: Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair.
Score: 10.144283429670807
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair. Moreover, we introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. On the test set, our results obtain a final 33.5 N mAP, 17.25 N+V mAP, 11.77 N+{\delta} mAP and 6.75 Overall top-5 mAP metric when trained on the v2 training dataset.

Related papers

VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
FIction: 4D Future Interaction Prediction from Video [63.37136159797888]
We introduce FIction for 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict which objects at what 3D locations the person will interact with in the next time period.
arXiv Detail & Related papers (2024-12-01T18:44:17Z)
Short-term Object Interaction Anticipation with Disentangled Object Detection @ Ego4D Short Term Object Interaction Anticipation Challenge [11.429137967096935]
Short-term object interaction anticipation is an important task in egocentric video analysis. Our proposed method, SOIA-DOD, effectively decomposes it into 1) detecting active object and 2) classifying interaction and predicting their timing. Our method first detects all potential active objects in the last frame of egocentric video by fine-tuning a pre-trained YOLOv9.
arXiv Detail & Related papers (2024-07-08T08:13:16Z)
AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation [14.734158936250918]
Short-Term object-interaction Anticipation is fundamental for wearable assistants or human robot interaction to understand user goals. We improve the performance of STA predictions with two contributions. First, we propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot.
arXiv Detail & Related papers (2024-06-03T10:57:18Z)
Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos [91.44553585470688]
Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
arXiv Detail & Related papers (2023-08-20T18:23:07Z)
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA) We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions. Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z)
The MI-Motion Dataset and Benchmark for 3D Multi-Person Motion Prediction [13.177817435234449]
3D multi-person motion prediction is a challenging task that involves modeling individual behaviors and interactions between people. We introduce the Multi-Person Interaction Motion (MI-Motion) dataset, which includes skeleton sequences of multiple individuals collected by motion capture systems. The dataset contains 167k frames of interacting people's skeleton poses and is categorized into 5 different activity scenes.
arXiv Detail & Related papers (2023-06-23T15:38:22Z)
Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones. SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders. Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z)
Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset [84.3946567650148]
With over 100,000 scenes, each 20 seconds long at 10 Hz, our new dataset contains more than 570 hours of unique data over 1750 km of roadways. We use a high-accuracy 3D auto-labeling system to generate high quality 3D bounding boxes for each road agent. We introduce a new set of metrics that provides a comprehensive evaluation of both single agent and joint agent interaction motion forecasting models.
arXiv Detail & Related papers (2021-04-20T17:19:05Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
A Graph-based Interactive Reasoning for Human-Object Interaction Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs. We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet. Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction [57.56466850377598]
Reasoning over visual data is a desirable capability for robotics and vision-based applications. In this paper, we present a framework on graph to uncover relationships in different objects in the scene for reasoning about pedestrian intent. Pedestrian intent, defined as the future action of crossing or not-crossing the street, is a very crucial piece of information for autonomous vehicles.
arXiv Detail & Related papers (2020-02-20T18:50:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.