Related papers: AnchorCrafter: Animate Cyber-Anchors Selling Your Products via Human-Object Interacting Video Generation

AnchorCrafter: Animate Cyber-Anchors Selling Your Products via Human-Object Interacting Video Generation

URL: http://arxiv.org/abs/2411.17383v2
Date: Mon, 23 Jun 2025 06:27:21 GMT
Title: AnchorCrafter: Animate Cyber-Anchors Selling Your Products via Human-Object Interacting Video Generation
Authors: Ziyi Xu, Ziyao Huang, Juan Cao, Yong Zhang, Xiaodong Cun, Qing Shuai, Yuchen Wang, Linchao Bao, Jintao Li, Fan Tang,
Abstract summary: The generation of anchor-style product promotion videos presents promising opportunities in e-commerce, advertising, and consumer engagement.<n>We introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object.<n>We propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives, and the HOI-motion injection, which enables complex human-object interactions.
Score: 40.81246588724407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The generation of anchor-style product promotion videos presents promising opportunities in e-commerce, advertising, and consumer engagement. Despite advancements in pose-guided human video generation, creating product promotion videos remains challenging. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Extensive experiments show that our system improves object appearance preservation by 7.5\% and doubles the object localization accuracy compared to existing state-of-the-art approaches. It also outperforms existing approaches in maintaining human motion consistency and high-quality video generation. Project page including data, code, and Huggingface demo: https://github.com/cangcz/AnchorCrafter.

Related papers

iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer [43.58952721477297]
This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation.<n> Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model.
arXiv Detail & Related papers (2025-06-15T13:41:43Z)
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers [30.583932208752877]
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important.<n>We propose a Diffusion Transformer (DiT)-based framework to preserve human identities and product-specific details.<n>We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements.
arXiv Detail & Related papers (2025-06-12T10:58:23Z)
Multi-identity Human Image Animation with Structural Video Diffusion [64.20452431561436]
We present Structural Video Diffusion, a novel framework for generating realistic multi-human videos. Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals. We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z)
Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model [72.90370736032115]
We present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive layout-instructed Diffusion model (Re-HOLD) Our key insight is to employ specialized layout representation for hands and objects, respectively. To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects.
arXiv Detail & Related papers (2025-03-21T08:40:35Z)
AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation [60.5897687447003]
AvatarGO is a novel framework designed to generate realistic 4D HOI scenes from textual inputs. Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling issues. As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.
arXiv Detail & Related papers (2024-10-09T17:58:56Z)
EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z)
Compositional 3D Human-Object Neural Animation [93.38239238988719]
Human-object interactions (HOIs) are crucial for human-centric scene understanding applications such as human-centric visual generation, AR/VR, and robotics. In this paper, we address this challenge in HOI animation from a compositional perspective. We adopt neural human-object deformation to model and render HOI dynamics based on implicit neural representations.
arXiv Detail & Related papers (2023-04-27T10:04:56Z)
HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video [24.553659249564852]
HOSNeRF reconstructs neural radiance fields for dynamic human-object-scene from a single monocular in-the-wild video. Our method enables pausing the video at any frame and rendering all scene details from arbitrary viewpoints.
arXiv Detail & Related papers (2023-04-24T17:21:49Z)
Learning Object Manipulation Skills from Video via Approximate Differentiable Physics [27.923004421974156]
We teach robots to perform simple object manipulation tasks by watching a single video demonstration. A differentiable scene ensures perceptual fidelity between the 3D scene and the 2D video. We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations.
arXiv Detail & Related papers (2022-08-03T10:21:47Z)
Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z)
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object. We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions. We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.