SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning
- URL: http://arxiv.org/abs/2508.14120v1
- Date: Mon, 18 Aug 2025 15:20:46 GMT
- Title: SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning
- Authors: Yuhang Lin, Yijia Xie, Jiahong Xie, Yuehao Huang, Ruoyu Wang, Jiajun Lv, Yukai Ma, Xingxing Zuo,
- Abstract summary: SimGenHOI is a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI.<n>Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose.<n>To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding.
- Score: 6.255814224573073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating physically realistic humanoid-object interactions (HOI) is a fundamental challenge in robotics. Existing HOI generation approaches, such as diffusion-based models, often suffer from artifacts such as implausible contacts, penetrations, and unrealistic whole-body actions, which hinder successful execution in physical environments. To address these challenges, we introduce SimGenHOI, a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI. Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose. These key actions capture essential interaction dynamics and are interpolated into smooth motion trajectories, naturally supporting long-horizon generation. To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding. Furthermore, we introduce a mutual fine-tuning strategy, where the generative model and the control policy iteratively refine each other, improving both motion realism and tracking robustness. Extensive experiments demonstrate that SimGenHOI generates realistic, diverse, and physically plausible humanoid-object interactions, achieving significantly higher tracking success rates in simulation and enabling long-horizon manipulation tasks. Code will be released upon acceptance on our project page: https://xingxingzuo.github.io/simgen_hoi.
Related papers
- D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping [66.22412592525369]
We introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine.<n>We show that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values.<n>Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping.
arXiv Detail & Related papers (2026-03-01T15:32:04Z) - MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z) - Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models [57.71440995598757]
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models.<n>Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world.
arXiv Detail & Related papers (2025-12-15T18:03:42Z) - PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System [67.2851799763138]
PhysHSI comprises a simulation training pipeline and a real-world deployment system.<n>In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data.<n>For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs.
arXiv Detail & Related papers (2025-10-13T07:11:37Z) - OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction [76.44108003274955]
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning policies.<n>We introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh.<n>By minimizing the Laplacian deformation between the human and robot meshes, OmniRetarget generates kinematically feasible trajectories.
arXiv Detail & Related papers (2025-09-30T17:59:02Z) - Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis [51.95817740348585]
Human-X is a novel framework designed to enable immersive and physically plausible human interactions across diverse entities.<n>Our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner.<n>Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction.
arXiv Detail & Related papers (2025-08-04T06:35:48Z) - PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation [35.563978243352764]
We introduce physical mapping, integrated throughout the human interaction generation pipeline.<n>Specifically, motion imitation within a physics-based simulation environment is used to project target motions into a physically valid space.<n>Experiments show our method achieves impressive results in generated human motion quality, with a 3%-89% improvement in physical fidelity.
arXiv Detail & Related papers (2025-06-09T06:04:49Z) - SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains [66.62502882481373]
Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions.<n>We propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions.<n>Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness.
arXiv Detail & Related papers (2024-11-27T10:13:35Z) - ReinDiffuse: Crafting Physically Plausible Motions with Reinforced Diffusion Model [9.525806425270428]
We present emphReinDiffuse that combines reinforcement learning with motion diffusion model to generate physically credible human motions.
Our method adapts Motion Diffusion Model to output a parameterized distribution of actions, making them compatible with reinforcement learning paradigms.
Our approach outperforms existing state-of-the-art models on two major datasets, HumanML3D and KIT-ML.
arXiv Detail & Related papers (2024-10-09T16:24:11Z) - Haptic Repurposing with GenAI [5.424247121310253]
Mixed Reality aims to merge the digital and physical worlds to create immersive human-computer interactions.
This paper introduces Haptic Repurposing with GenAI, an innovative approach to enhance MR interactions by transforming any physical objects into adaptive haptic interfaces for AI-generated virtual assets.
arXiv Detail & Related papers (2024-06-11T13:06:28Z) - I-CTRL: Imitation to Control Humanoid Robots Through Constrained Reinforcement Learning [8.97654258232601]
We develop a framework to control humanoid robots through bounded residual reinforcement learning (I-CTRL)<n>I-CTRL excels in motion imitation with simple and unique rewards that generalize across five robots.<n>Our framework introduces an automatic priority scheduler to manage large-scale motion datasets.
arXiv Detail & Related papers (2024-05-14T16:12:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.