Related papers: Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

URL: http://arxiv.org/abs/2405.04370v4
Date: Fri, 22 Nov 2024 05:09:29 GMT
Title: Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos
Authors: Junyi Ma, Jingyi Xu, Xieyuanli Chen, Hesheng Wang,
Abstract summary: We propose Diff-IP2D to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. Our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our newly proposed evaluation protocol.
Score: 22.81433371521832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future hand-object interactions in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our newly proposed evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D is released as open source at https://github.com/IRMVLab/Diff-IP2D.

Related papers

Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning [50.76723760768117]
Existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos.<n>We find that human appearance can provide a straightforward cue to address these obstacles.<n>We propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws.
arXiv Detail & Related papers (2025-07-03T12:19:26Z)
FIction: 4D Future Interaction Prediction from Video [63.37136159797888]
We introduce FIction for 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict which objects at what 3D locations the person will interact with in the next time period.
arXiv Detail & Related papers (2024-12-01T18:44:17Z)
MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos [27.766405152248055]
Hand trajectory prediction plays a vital role in comprehending human motion patterns. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. We propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models.
arXiv Detail & Related papers (2024-09-04T12:06:33Z)
Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z)
Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior. Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z)
InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion [29.25063155767897]
This paper addresses a novel task of anticipating 3D human-object interactions (HOIs) Our task is significantly more challenging, as it requires modeling dynamic objects with various shapes, capturing whole-body motion, and ensuring physically valid interactions. Experiments on multiple human-object interaction datasets demonstrate the effectiveness of our method for this task, capable of producing realistic, vivid, and remarkably long-term 3D HOI predictions.
arXiv Detail & Related papers (2023-08-31T17:59:08Z)
TransFusion: A Practical and Effective Transformer-based Diffusion Model for 3D Human Motion Prediction [1.8923948104852863]
We propose TransFusion, an innovative and practical diffusion-based model for 3D human motion prediction. Our model leverages Transformer as the backbone with long skip connections between shallow and deep layers. In contrast to prior diffusion-based models that utilize extra modules like cross-attention and adaptive layer normalization, we treat all inputs, including conditions, as tokens to create a more lightweight model.
arXiv Detail & Related papers (2023-07-30T01:52:07Z)
Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection [30.896749712316222]
This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as atemporal graph with human and object nodes as input. We achieve state-of-the-art performance on CAD-120 and Something-Else dataset.
arXiv Detail & Related papers (2022-06-07T07:26:06Z)
Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction [63.62263239934777]
We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task. We propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction. Our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency.
arXiv Detail & Related papers (2021-12-30T10:45:22Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
SGCN:Sparse Graph Convolution Network for Pedestrian Trajectory Prediction [64.16212996247943]
We present a Sparse Graph Convolution Network(SGCN) for pedestrian trajectory prediction. Specifically, the SGCN explicitly models the sparse directed interaction with a sparse directed spatial graph to capture adaptive interaction pedestrians. visualizations indicate that our method can capture adaptive interactions between pedestrians and their effective motion tendencies.
arXiv Detail & Related papers (2021-04-04T03:17:42Z)
End-to-end Contextual Perception and Prediction with Interaction Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture. Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.