Related papers: DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

URL: http://arxiv.org/abs/2506.10568v1
Date: Thu, 12 Jun 2025 10:58:23 GMT
Title: DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Authors: Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wang, Zerong Zheng, Ming Zhou,
Abstract summary: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important.<n>We propose a Diffusion Transformer (DiT)-based framework to preserve human identities and product-specific details.<n>We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements.
Score: 30.583932208752877
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.

Related papers

iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer [43.58952721477297]
This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation.<n> Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model.
arXiv Detail & Related papers (2025-06-15T13:41:43Z)
SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios [48.09735396455107]
Hand-Object Interaction (HOI) generation has significant application potential.<n>Current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data.<n>We propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously.
arXiv Detail & Related papers (2025-06-03T05:04:29Z)
Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model [72.90370736032115]
We present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive layout-instructed Diffusion model (Re-HOLD)<n>Our key insight is to employ specialized layout representation for hands and objects, respectively.<n>To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects.
arXiv Detail & Related papers (2025-03-21T08:40:35Z)
BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects [70.20706475051347]
BimArt is a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects.<n>We first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation.<n>The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation.
arXiv Detail & Related papers (2024-12-06T14:23:56Z)
AnchorCrafter: Animate Cyber-Anchors Selling Your Products via Human-Object Interacting Video Generation [40.81246588724407]
The generation of anchor-style product promotion videos presents promising opportunities in e-commerce, advertising, and consumer engagement.<n>We introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object.<n>We propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives, and the HOI-motion injection, which enables complex human-object interactions.
arXiv Detail & Related papers (2024-11-26T12:42:13Z)
Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony [55.26315526382004]
We propose a novel framework, Combo, for co-speech holistic 3D human motion generation. In particular, we identify that one fundamental challenge as the multiple-input-multiple-output nature of the generative model of interest. Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion.
arXiv Detail & Related papers (2024-08-18T07:48:49Z)
VirtualModel: Generating Object-ID-retentive Human-object Interaction Image by Diffusion Model for E-commerce Marketing [20.998016266794952]
Existing works, such as Controlnet [36], T2I-adapter [20] and HumanSD [10] have demonstrated good abilities in generating human images based on pose conditions. In this paper, we first define a new human image generation task for e-commerce marketing, i.e., Object-ID-retentive Human-object Interaction image Generation (OHG) We propose a VirtualModel framework to generate human images for product shown, which supports displays of any categories of products and any types of human-object interaction.
arXiv Detail & Related papers (2024-05-16T11:05:41Z)
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance [48.986552871497]
We introduce a novel two-stage framework that employs scene affordance as an intermediate representation. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals. Our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE.
arXiv Detail & Related papers (2024-03-26T18:41:07Z)
TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data. We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
An Identity-Preserved Framework for Human Motion Transfer [3.6286856791379463]
Human motion transfer (HMT) aims to generate a video clip for the target subject by imitating the source subject's motion. Previous methods have achieved good results in good-quality videos, but lose sight of individualized motion information from the source and target motions. We propose a novel identity-preserved HMT network, termed textitIDPres.
arXiv Detail & Related papers (2022-04-14T10:27:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.