DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models
- URL: http://arxiv.org/abs/2501.08333v1
- Date: Tue, 14 Jan 2025 18:59:59 GMT
- Title: DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models
- Authors: Hyeonwoo Kim, Sangwon Beak, Hanbyul Joo,
- Abstract summary: We present a method to learn the 3D dynamic affordance from synthetically generated 2D videos.<n>Specifically, we propose a pipeline that first generates 2D HOI videos from the 3D object and then lifts them into 3D to generate 4D HOI samples.
- Score: 9.103840202072336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the ability of humans to use objects is crucial for AI to improve daily life. Existing studies for learning such ability focus on human-object patterns (e.g., contact, spatial relation, orientation) in static situations, and learning Human-Object Interaction (HOI) patterns over time (i.e., movement of human and object) is relatively less explored. In this paper, we introduce a novel type of affordance named Dynamic Affordance. For a given input 3D object mesh, we learn dynamic affordance which models the distribution of both (1) human motion and (2) human-guided object pose during interactions. As a core idea, we present a method to learn the 3D dynamic affordance from synthetically generated 2D videos, leveraging a pre-trained video diffusion model. Specifically, we propose a pipeline that first generates 2D HOI videos from the 3D object and then lifts them into 3D to generate 4D HOI samples. Once we generate diverse 4D HOI samples on various target objects, we train our DAViD, where we present a method based on the Low-Rank Adaptation (LoRA) module for pre-trained human motion diffusion model (MDM) and an object pose diffusion model with human pose guidance. Our motion diffusion model is extended for multi-object interactions, demonstrating the advantage of our pipeline with LoRA for combining the concepts of object usage. Through extensive experiments, we demonstrate our DAViD outperforms the baselines in generating human motion with HOIs.
Related papers
- GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects [13.830968058014546]
GenHOI is a two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences.<n>We introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate 3D HOIs into densely temporally coherent 4D HOI sequences.<n> Experimental results show that we achieve state-of-the-art results on the publicly available OMOMODM and 3D-FUTURE datasets.
arXiv Detail & Related papers (2025-06-18T14:17:53Z) - Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets.
We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images.
arXiv Detail & Related papers (2025-03-25T23:55:47Z) - AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation [60.5897687447003]
AvatarGO is a novel framework designed to generate realistic 4D HOI scenes from textual inputs.
Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling issues.
As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.
arXiv Detail & Related papers (2024-10-09T17:58:56Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation.
We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets.
Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z) - DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Realistic Human Motion Generation with Cross-Diffusion Models [30.854425772128568]
Cross Human Motion Diffusion Model (CrossDiff)
Method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model.
CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences.
arXiv Detail & Related papers (2023-12-18T07:44:40Z) - H4D: Human 4D Modeling by Learning Neural Compositional Representation [75.34798886466311]
This work presents a novel framework that can effectively learn a compact and compositional representation for dynamic human.
A simple yet effective linear motion model is proposed to provide a rough and regularized motion estimation.
Experiments demonstrate our method is not only efficacy in recovering dynamic human with accurate motion and detailed geometry, but also amenable to various 4D human related tasks.
arXiv Detail & Related papers (2022-03-02T17:10:49Z) - Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses.
Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.