Related papers: Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router

Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router

URL: http://arxiv.org/abs/2506.19833v1
Date: Tue, 24 Jun 2025 17:50:16 GMT
Title: Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router
Authors: Yubo Huang, Weiqiang Wang, Sirui Zhao, Tong Xu, Lin Liu, Enhong Chen,
Abstract summary: We introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene.<n>Specifically, we propose a novel framework incorporating a fine-grained Embedding Router that binds who' and speak what' together to address the audio-to-character correspondence control.
Score: 72.29811385678168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent years have witnessed remarkable advances in audio-driven talking head generation. However, existing approaches predominantly focus on single-character scenarios. While some methods can create separate conversation videos between two individuals, the critical challenge of generating unified conversation videos with multiple physically co-present characters sharing the same spatial environment remains largely unaddressed. This setting presents two key challenges: audio-to-character correspondence control and the lack of suitable datasets featuring multi-character talking videos within the same scene. To address these challenges, we introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene. Specifically, we propose (1) A novel framework incorporating a fine-grained Embedding Router that binds `who' and `speak what' together to address the audio-to-character correspondence control. (2) Two methods for implementing a 3D-mask embedding router that enables frame-wise, fine-grained control of individual characters, with distinct loss functions based on observed geometric priors and a mask refinement strategy to enhance the accuracy and temporal smoothness of the predicted masks. (3) The first dataset, to the best of our knowledge, specifically constructed for multi-talking-character video generation, and accompanied by an open-source data processing pipeline, and (4) A benchmark for the dual-talking-characters video generation, with extensive experiments demonstrating superior performance over multiple state-of-the-art methods.

Related papers

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters [14.594698765723756]
HunyuanVideo-Avatar is a model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos.<n>A character image injection module is designed to replace the conventional addition-based character conditioning scheme.<n>An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video.<n>A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask.
arXiv Detail & Related papers (2025-05-26T15:57:27Z)
Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation [62.56037816595509]
Mask$2$DiT establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations.<n>This attention mechanism enables precise segment-level textual-to-visual alignment.<n>Mask$2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.
arXiv Detail & Related papers (2025-03-25T17:46:50Z)
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z)
Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance [29.768141136041454]
We propose a novel multi-character video generation framework, which is based on the separated text and pose guidance.<n>Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs.<n>The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation.
arXiv Detail & Related papers (2024-12-21T05:49:40Z)
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation [60.07447565026327]
We propose DreamRunner, a novel story-to-video generation method.<n>We structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning and fine-grained object-level layout and motion planning.<n>DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos.
arXiv Detail & Related papers (2024-11-25T18:41:56Z)
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs.<n>The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions.<n> Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face. Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z)
Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person. OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.