InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
- URL: http://arxiv.org/abs/2502.20390v1
- Date: Thu, 27 Feb 2025 18:59:12 GMT
- Title: InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
- Authors: Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, Liang-Yan Gui,
- Abstract summary: We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data.<n>Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets.
- Score: 27.225777494300775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
Related papers
- Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics [30.43930233035367]
This paper introduces a unified Human-Object Interaction framework.
It provides unified control over interactions with static scenes and dynamic objects using language commands.
Our framework supports long-horizon interactions among dynamic, articulated, and static objects.
arXiv Detail & Related papers (2025-03-24T05:18:04Z) - ObjectMover: Generative Object Movement with Video Prior [69.75281888309017]
We present ObjectMover, a generative model that can perform object movement in challenging scenes.
We show that with this approach, our model is able to adjust to complex real-world scenarios.
We propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization.
arXiv Detail & Related papers (2025-03-11T04:42:59Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video foundation models can act as both neurals and implicit physics simulators by learning interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Integrating Physics and Topology in Neural Networks for Learning Rigid Body Dynamics [6.675805308519987]
We introduce a novel framework for modeling rigid body dynamics and learning collision interactions.
We propose a physics-informed message-passing neural architecture, embedding physical laws directly in the model.
This work addresses the challenge of multi-entity dynamic interactions, with applications spanning diverse scientific and engineering domains.
arXiv Detail & Related papers (2024-11-18T11:03:15Z) - CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation [44.9991846328409]
Crowd Motion Generation is essential in entertainment industries such as animation and games as well as in strategic fields like urban simulation and planning.
We introduce CrowdMoGen, a zero-shot text-driven framework that harnesses the power of Large Language Model (LLM) to incorporate the collective intelligence into the motion generation framework.
Our framework consists of two key components: 1) Crowd Scene Planner that learns to coordinate motions and dynamics according to specific scene contexts or introduced perturbations, and 2) Collective Motion Generator that efficiently synthesizes the required collective motions.
arXiv Detail & Related papers (2024-07-08T17:59:36Z) - Human-Object Interaction from Human-Level Instructions [17.10279738828331]
We propose the first complete system for synthesizing human-object interactions for object manipulation in contextual environments.<n>We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans.<n>Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements.
arXiv Detail & Related papers (2024-06-25T17:46:28Z) - CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics [44.30880626337739]
CooHOI is a framework designed to tackle the challenge of multi-humanoid object transportation problem.
A single humanoid character learns to interact with objects through imitation learning from human motion priors.
Then, the humanoid learns to collaborate with others by considering the shared dynamics of the manipulated object.
arXiv Detail & Related papers (2024-06-20T17:59:22Z) - Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes.
Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene.
Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z) - Physically Plausible Full-Body Hand-Object Interaction Synthesis [32.83908152822006]
We propose a physics-based method for synthesizing dexterous hand-object interactions in a full-body setting.
Existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts.
arXiv Detail & Related papers (2023-09-14T17:55:18Z) - Tachikuma: Understading Complex Interactions with Multi-Character and
Novel Objects by Large Language Models [67.20964015591262]
We introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation task and a supporting dataset.
The dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations.
We present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding.
arXiv Detail & Related papers (2023-07-24T07:40:59Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Transformer Inertial Poser: Attention-based Real-time Human Motion
Reconstruction from Sparse IMUs [79.72586714047199]
We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time.
Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
arXiv Detail & Related papers (2022-03-29T16:24:52Z) - iGibson, a Simulation Environment for Interactive Tasks in Large
Realistic Scenes [54.04456391489063]
iGibson is a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes.
Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects.
iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors.
arXiv Detail & Related papers (2020-12-05T02:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.