Related papers: VirtualModel: Generating Object-ID-retentive Human-object Interaction Image by Diffusion Model for E-commerce Marketing

VirtualModel: Generating Object-ID-retentive Human-object Interaction Image by Diffusion Model for E-commerce Marketing

URL: http://arxiv.org/abs/2405.09985v1
Date: Thu, 16 May 2024 11:05:41 GMT
Title: VirtualModel: Generating Object-ID-retentive Human-object Interaction Image by Diffusion Model for E-commerce Marketing
Authors: Binghui Chen, Chongyang Zhong, Wangmeng Xiang, Yifeng Geng, Xuansong Xie,
Abstract summary: Existing works, such as Controlnet [36], T2I-adapter [20] and HumanSD [10] have demonstrated good abilities in generating human images based on pose conditions. In this paper, we first define a new human image generation task for e-commerce marketing, i.e., Object-ID-retentive Human-object Interaction image Generation (OHG) We propose a VirtualModel framework to generate human images for product shown, which supports displays of any categories of products and any types of human-object interaction.
Score: 20.998016266794952
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the significant advances in large-scale text-to-image generation by diffusion model (DM), controllable human image generation has been attracting much attention recently. Existing works, such as Controlnet [36], T2I-adapter [20] and HumanSD [10] have demonstrated good abilities in generating human images based on pose conditions, they still fail to meet the requirements of real e-commerce scenarios. These include (1) the interaction between the shown product and human should be considered, (2) human parts like face/hand/arm/foot and the interaction between human model and product should be hyper-realistic, and (3) the identity of the product shown in advertising should be exactly consistent with the product itself. To this end, in this paper, we first define a new human image generation task for e-commerce marketing, i.e., Object-ID-retentive Human-object Interaction image Generation (OHG), and then propose a VirtualModel framework to generate human images for product shown, which supports displays of any categories of products and any types of human-object interaction. As shown in Figure 1, VirtualModel not only outperforms other methods in terms of accurate pose control and image quality but also allows for the display of user-specified product objects by maintaining the product-ID consistency and enhancing the plausibility of human-object interaction. Codes and data will be released.

Related papers

HOComp: Interaction-Aware Human-Object Composition [62.93211305213214]
HOComp is a novel approach for compositing a foreground object onto a human-centric background image.<n> Experimental results on our dataset show that HOComp effectively generates human-object interactions with consistent appearances.
arXiv Detail & Related papers (2025-07-22T17:59:21Z)
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers [30.583932208752877]
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important.<n>We propose a Diffusion Transformer (DiT)-based framework to preserve human identities and product-specific details.<n>We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements.
arXiv Detail & Related papers (2025-06-12T10:58:23Z)
EVA: Expressive Virtual Avatars from Multi-view Videos [51.33851869426057]
We introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework.<n>EVA achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures.<n>This work represents a significant advancement towards fully drivable digital human models.
arXiv Detail & Related papers (2025-05-21T11:22:52Z)
Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images.
arXiv Detail & Related papers (2025-03-25T23:55:47Z)
TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions [33.58559068016724]
We present the first unified model for modeling 3D human-object interaction (HOI) We generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process. We show the applicability of TriDi to scene population, generating objects for human-contact datasets, and generalization to unseen object geometry.
arXiv Detail & Related papers (2024-12-09T09:35:05Z)
AnchorCrafter: Animate Cyber-Anchors Selling Your Products via Human-Object Interacting Video Generation [40.81246588724407]
The generation of anchor-style product promotion videos presents promising opportunities in e-commerce, advertising, and consumer engagement.<n>We introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object.<n>We propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives, and the HOI-motion injection, which enables complex human-object interactions.
arXiv Detail & Related papers (2024-11-26T12:42:13Z)
Single Image, Any Face: Generalisable 3D Face Generation [59.9369171926757]
We propose a novel model, Gen3D-Face, which generates 3D human faces with unconstrained single image input. To the best of our knowledge, this is the first attempt and benchmark for creating photorealistic 3D human face avatars from single images.
arXiv Detail & Related papers (2024-09-25T14:56:37Z)
Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape. We collect 35K trials of behavioral data from over 500 participants. We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z)
CapHuman: Capture Your Moments in Parallel Universes [60.06408546134581]
We present a new framework named CapHuman. CapHuman encodes identity features and then learns to align them into the latent space. We introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.
arXiv Detail & Related papers (2024-02-01T14:41:59Z)
Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation [38.08445005326031]
We propose ProciGen to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Procedural Diffusion Model) Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes.
arXiv Detail & Related papers (2023-12-12T08:32:55Z)
Cross-view and Cross-pose Completion for 3D Human Understanding [22.787947086152315]
We propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks.
arXiv Detail & Related papers (2023-11-15T16:51:18Z)
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion [114.15397904945185]
We propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts. Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network. Our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios.
arXiv Detail & Related papers (2023-10-12T17:59:34Z)
Hand-Object Interaction Image Generation [135.87707468156057]
This work is dedicated to a new task, i.e., hand-object interaction image generation. It aims to conditionally generate the hand-object image under the given hand, object and their interaction status. This task is challenging and research-worthy in many potential application scenarios, such as AR/VR games and online shopping.
arXiv Detail & Related papers (2022-11-28T18:59:57Z)
Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors [42.17542596399014]
We present a method for inferring diverse 3D models of human-object interactions from images. Our method extracts high-level commonsense knowledge from large language models. We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset.
arXiv Detail & Related papers (2022-09-06T13:32:55Z)
AvatarGen: a 3D Generative Model for Animatable Human Avatars [108.11137221845352]
AvatarGen is the first method that enables not only non-rigid human generation with diverse appearance but also full control over poses and viewpoints. To model non-rigid dynamics, it introduces a deformation network to learn pose-dependent deformations in the canonical space. Our method can generate animatable human avatars with high-quality appearance and geometry modeling, significantly outperforming previous 3D GANs.
arXiv Detail & Related papers (2022-08-01T01:27:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.