InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
- URL: http://arxiv.org/abs/2502.20390v1
- Date: Thu, 27 Feb 2025 18:59:12 GMT
- Title: InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
- Authors: Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, Liang-Yan Gui,
- Abstract summary: We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data.<n>Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets.
- Score: 27.225777494300775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
Related papers
- D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping [66.22412592525369]
We introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine.<n>We show that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values.<n>Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping.
arXiv Detail & Related papers (2026-03-01T15:32:04Z) - InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions [58.329946838699044]
Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements.<n>Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills.<n>We introduce InterPrior, a framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning.
arXiv Detail & Related papers (2026-02-05T18:59:27Z) - Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations [63.80827184637476]
We introduce D-STAR, a hierarchical policy that disentangles when to act from where to act.<n>We validate our framework through extensive and rigorous simulations.
arXiv Detail & Related papers (2026-01-14T14:37:06Z) - Learning Interactive World Model for Object-Centric Reinforcement Learning [27.710001478315288]
We introduce a unified framework that learns structured representations of both objects and their interactions within a world model.<n>FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions.<n>On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines.
arXiv Detail & Related papers (2025-11-04T03:35:58Z) - PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System [67.2851799763138]
PhysHSI comprises a simulation training pipeline and a real-world deployment system.<n>In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data.<n>For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs.
arXiv Detail & Related papers (2025-10-13T07:11:37Z) - DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model [22.46947045094797]
We develop a novel framework that enables a single policy, trained in simulation, to generalize to a wide variety of objects and conditions in the real world.<n>We show that a single policy successfully rotates challenging objects with complex shapes (e.g., animals), high aspect ratios (up to 5.33), and small sizes.
arXiv Detail & Related papers (2025-10-09T17:59:11Z) - OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction [76.44108003274955]
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning policies.<n>We introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh.<n>By minimizing the Laplacian deformation between the human and robot meshes, OmniRetarget generates kinematically feasible trajectories.
arXiv Detail & Related papers (2025-09-30T17:59:02Z) - SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning [6.255814224573073]
SimGenHOI is a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI.<n>Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose.<n>To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding.
arXiv Detail & Related papers (2025-08-18T15:20:46Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics [30.43930233035367]
This paper introduces a unified Human-Object Interaction framework.
It provides unified control over interactions with static scenes and dynamic objects using language commands.
Our framework supports long-horizon interactions among dynamic, articulated, and static objects.
arXiv Detail & Related papers (2025-03-24T05:18:04Z) - ObjectMover: Generative Object Movement with Video Prior [69.75281888309017]
We present ObjectMover, a generative model that can perform object movement in challenging scenes.
We show that with this approach, our model is able to adjust to complex real-world scenarios.
We propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization.
arXiv Detail & Related papers (2025-03-11T04:42:59Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video foundation models can act as both neurals and implicit physics simulators by learning interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Integrating Physics and Topology in Neural Networks for Learning Rigid Body Dynamics [6.675805308519987]
We introduce a novel framework for modeling rigid body dynamics and learning collision interactions.
We propose a physics-informed message-passing neural architecture, embedding physical laws directly in the model.
This work addresses the challenge of multi-entity dynamic interactions, with applications spanning diverse scientific and engineering domains.
arXiv Detail & Related papers (2024-11-18T11:03:15Z) - CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation [44.9991846328409]
Crowd Motion Generation is essential in entertainment industries such as animation and games as well as in strategic fields like urban simulation and planning.
We introduce CrowdMoGen, a zero-shot text-driven framework that harnesses the power of Large Language Model (LLM) to incorporate the collective intelligence into the motion generation framework.
Our framework consists of two key components: 1) Crowd Scene Planner that learns to coordinate motions and dynamics according to specific scene contexts or introduced perturbations, and 2) Collective Motion Generator that efficiently synthesizes the required collective motions.
arXiv Detail & Related papers (2024-07-08T17:59:36Z) - Human-Object Interaction from Human-Level Instructions [17.10279738828331]
We propose the first complete system for synthesizing human-object interactions for object manipulation in contextual environments.<n>We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans.<n>Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements.
arXiv Detail & Related papers (2024-06-25T17:46:28Z) - CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics [44.30880626337739]
CooHOI is a framework designed to tackle the challenge of multi-humanoid object transportation problem.
A single humanoid character learns to interact with objects through imitation learning from human motion priors.
Then, the humanoid learns to collaborate with others by considering the shared dynamics of the manipulated object.
arXiv Detail & Related papers (2024-06-20T17:59:22Z) - Grasp Anything: Combining Teacher-Augmented Policy Gradient Learning with Instance Segmentation to Grasp Arbitrary Objects [18.342569823885864]
Teacher-Augmented Policy Gradient (TAPG) is a novel two-stage learning framework that synergizes reinforcement learning and policy distillation.
TAPG facilitates guided, yet adaptive, learning of a sensorimotor policy, based on object segmentation.
Our trained policies adeptly grasp a wide variety of objects from cluttered scenarios in simulation and the real world based on human-understandable prompts.
arXiv Detail & Related papers (2024-03-15T10:48:16Z) - Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes.
Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene.
Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z) - Physically Plausible Full-Body Hand-Object Interaction Synthesis [32.83908152822006]
We propose a physics-based method for synthesizing dexterous hand-object interactions in a full-body setting.
Existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts.
arXiv Detail & Related papers (2023-09-14T17:55:18Z) - Tachikuma: Understading Complex Interactions with Multi-Character and
Novel Objects by Large Language Models [67.20964015591262]
We introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation task and a supporting dataset.
The dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations.
We present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding.
arXiv Detail & Related papers (2023-07-24T07:40:59Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Transformer Inertial Poser: Attention-based Real-time Human Motion
Reconstruction from Sparse IMUs [79.72586714047199]
We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time.
Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
arXiv Detail & Related papers (2022-03-29T16:24:52Z) - iGibson, a Simulation Environment for Interactive Tasks in Large
Realistic Scenes [54.04456391489063]
iGibson is a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes.
Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects.
iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors.
arXiv Detail & Related papers (2020-12-05T02:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.