Related papers: iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer

iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer

URL: http://arxiv.org/abs/2506.12847v1
Date: Sun, 15 Jun 2025 13:41:43 GMT
Title: iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer
Authors: Zhelun Shen, Chenming Wu, Junsheng Zhou, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Wei He, Jingdong Wang,
Abstract summary: This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation.<n> Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model.
Score: 43.58952721477297
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Digital human video generation is gaining traction in fields like education and e-commerce, driven by advancements in head-body animation and lip-syncing technologies. However, realistic Hand-Object Interaction (HOI) - the complex dynamics between human hands and objects - continues to pose challenges. Generating natural and believable HOI reenactments is difficult due to issues such as occlusion between hands and objects, variations in object shapes and orientations, and the necessity for precise physical interactions, and importantly, the ability to generalize to unseen humans and objects. This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation. Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model. The first stage generates a key frame by inserting the designated object into the hand region, providing a reference for subsequent frames. The second stage ensures temporal coherence and fluidity in hand-object interactions. The key contribution of our method is to reuse the pretrained model's context perception capabilities without introducing additional parameters, enabling strong generalization to unseen objects and scenarios, and our proposed paradigm naturally supports long video generation. Comprehensive evaluations demonstrate that our approach outperforms existing methods, particularly in challenging real-world scenes, offering enhanced realism and more seamless hand-object interactions.

Related papers

ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning [19.292101162897975]
We introduce ByteLoom, a framework that generates realistic HOI videos with geometrically consistent object illustration.<n>We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency.<n>We then design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh.
arXiv Detail & Related papers (2025-12-28T09:38:36Z)
SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation [64.3409486422946]
We present SpriteHand, an autoregressive video generation framework for real-time synthesis of hand-object interaction videos.<n>Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence.<n> Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
arXiv Detail & Related papers (2025-12-01T18:13:40Z)
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation [18.328135509017944]
We propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations.<n>This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios.<n>Our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos.
arXiv Detail & Related papers (2025-12-01T13:44:31Z)
Precise Action-to-Video Generation Through Visual Action Prompts [62.951609704196485]
Action-driven video generation faces a precision-generality trade-off.<n>Agent-centric action signals provide precision at the cost of cross-domain transferability.<n>We "render" actions into precise visual prompts as domain-agnostic representations.
arXiv Detail & Related papers (2025-08-18T17:12:28Z)
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers [30.583932208752877]
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important.<n>We propose a Diffusion Transformer (DiT)-based framework to preserve human identities and product-specific details.<n>We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements.
arXiv Detail & Related papers (2025-06-12T10:58:23Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z)
Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model [72.90370736032115]
We present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive layout-instructed Diffusion model (Re-HOLD)<n>Our key insight is to employ specialized layout representation for hands and objects, respectively.<n>To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects.
arXiv Detail & Related papers (2025-03-21T08:40:35Z)
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.<n>We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.<n>We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z)
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z)
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness [57.18183962641015]
We present HOI-Swap, a video editing framework trained in a self-supervised manner. The first stage focuses on object swapping in a single frame with HOI awareness. The second stage extends the single-frame edit across the entire sequence.
arXiv Detail & Related papers (2024-06-11T22:31:29Z)
DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions [15.417836855005087]
We propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions.<n>The method introduces three techniques that enable effective learning from limited data.
arXiv Detail & Related papers (2024-03-26T16:06:42Z)
Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views [41.50710846018882]
We propose a neural rendering and pose estimation system for hand-object interaction from sparse views. We first learn the shape and appearance prior knowledge of hands and objects separately with the neural representation. During the online stage, we design a rendering-based joint model fitting framework to understand the dynamic hand-object interaction.
arXiv Detail & Related papers (2023-08-22T05:17:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.