FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation
- URL: http://arxiv.org/abs/2602.13444v1
- Date: Fri, 13 Feb 2026 20:46:08 GMT
- Title: FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation
- Authors: Huajian Zeng, Lingyun Chen, Jiaqi Yang, Yuantai Zhang, Fan Shi, Peidong Liu, Xingxing Zuo,
- Abstract summary: FlowHOI is a flow-matching framework that generates temporally coherent HOI sequences.<n>We show that FlowHOI achieves the highest action recognition accuracy and a 1.7$times$ higher physics simulation success rate.
- Score: 23.19464039872024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.
Related papers
- Structural Action Transformer for 3D Dexterous Manipulation [80.07649565189035]
Cross-embodiment skill transfer is a challenge for high-DoF robotic hands.<n>Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations.<n>This paper proposes a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective.
arXiv Detail & Related papers (2026-03-04T11:38:12Z) - MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z) - AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation [45.753757870577196]
We introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning.<n>We show that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses.
arXiv Detail & Related papers (2026-02-04T15:42:58Z) - VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification [65.15340059997273]
VHOI is a framework for creating realistic human-object interactions in video.<n>We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics.<n> Experiments demonstrate state-of-the-art results in controllable HOI video generation.
arXiv Detail & Related papers (2025-12-10T13:40:24Z) - SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios [48.09735396455107]
Hand-Object Interaction (HOI) generation has significant application potential.<n>Current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data.<n>We propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously.
arXiv Detail & Related papers (2025-06-03T05:04:29Z) - HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z) - MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation [28.75149480374178]
MEgoHand is a framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose.<n>It achieves substantial reductions in wrist translation error and joint rotation error, highlighting its capacity to accurately model fine-grained hand joint structures.
arXiv Detail & Related papers (2025-05-22T12:37:47Z) - SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z) - ManiTrend: Bridging Future Generation and Action Prediction with 3D Flow for Robotic Manipulation [11.233768932957771]
3D flow represents the motion trend of 3D particles within a scene.<n>ManiTrend is a unified framework that models the dynamics of 3D particles, vision observations and manipulation actions.<n>Our method achieves state-of-the-art performance with high efficiency.
arXiv Detail & Related papers (2025-02-14T09:13:57Z) - G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation [65.86819811007157]
We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D representation by leveraging foundation models.<n>Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates.<n>Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.
arXiv Detail & Related papers (2024-11-27T14:17:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.