InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos
- URL: http://arxiv.org/abs/2509.00767v1
- Date: Sun, 31 Aug 2025 09:38:59 GMT
- Title: InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos
- Authors: Yangsong Zhang, Abdul Ahad Butt, Gül Varol, Ivan Laptev,
- Abstract summary: We propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions.<n>Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions.
- Score: 35.49253697710172
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.
Related papers
- HUMOTO: A 4D Dataset of Mocap Human Object Interactions [27.573065832588554]
Human Motions with Objects is a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications.<n>Humoto captures interactions with 63 precisely modeled objects and 72 articulated parts.<n>Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations.
arXiv Detail & Related papers (2025-04-14T16:59:29Z) - AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation [60.5897687447003]
AvatarGO is a novel framework designed to generate realistic 4D HOI scenes from textual inputs.
Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling issues.
As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.
arXiv Detail & Related papers (2024-10-09T17:58:56Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - ParaHome: Parameterizing Everyday Home Activities Towards 3D Generative Modeling of Human-Object Interactions [10.364340631868322]
We introduce our ParaHome system designed to capture dynamic 3D movements of humans and objects within a common home environment.<n>Our system features a multi-view setup with 70 synchronized RGB cameras, along with wearable motion capture devices including an IMU-based body suit and hand motion capture gloves.<n>By leveraging the ParaHome system, we collect a new human-object interaction dataset, including 486 minutes of sequences across 207 captures with 38 participants.
arXiv Detail & Related papers (2024-01-18T18:59:58Z) - Revisit Human-Scene Interaction via Space Occupancy [55.67657438543008]
Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks.
In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective.
By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database.
arXiv Detail & Related papers (2023-12-05T12:03:00Z) - Object Motion Guided Human Motion Synthesis [22.08240141115053]
We study the problem of full-body human motion synthesis for the manipulation of large-sized objects.
We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework.
We develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated.
arXiv Detail & Related papers (2023-09-28T08:22:00Z) - Synthesizing Diverse Human Motions in 3D Indoor Scenes [16.948649870341782]
We present a novel method for populating 3D indoor scenes with virtual humans that can navigate in the environment and interact with objects in a realistic manner.
Existing approaches rely on training sequences that contain captured human motions and the 3D scenes they interact with.
We propose a reinforcement learning-based approach that enables virtual humans to navigate in 3D scenes and interact with objects realistically and autonomously.
arXiv Detail & Related papers (2023-05-21T09:22:24Z) - Task-Oriented Human-Object Interactions Generation with Implicit Neural
Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations.
Our method generates continuous motions that are parameterized only by the temporal coordinate.
This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z) - HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes [54.61610144668777]
We present a novel scene-and-language conditioned generative model that can produce 3D human motions in 3D scenes.
Our experiments demonstrate that our model generates diverse and semantically consistent human motions in 3D scenes.
arXiv Detail & Related papers (2022-10-18T10:14:11Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - Learning Motion Priors for 4D Human Body Capture in 3D Scenes [81.54377747405812]
We propose LEMO: LEarning human MOtion priors for 4D human body capture.
We introduce a novel motion prior, which reduces the jitters exhibited by poses recovered over a sequence.
We also design a contact friction term and a contact-aware motion infiller obtained via per-instance self-supervised training.
With our pipeline, we demonstrate high-quality 4D human body capture, reconstructing smooth motions and physically plausible body-scene interactions.
arXiv Detail & Related papers (2021-08-23T20:47:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.