Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router
- URL: http://arxiv.org/abs/2506.19833v1
- Date: Tue, 24 Jun 2025 17:50:16 GMT
- Title: Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router
- Authors: Yubo Huang, Weiqiang Wang, Sirui Zhao, Tong Xu, Lin Liu, Enhong Chen,
- Abstract summary: We introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene.<n>Specifically, we propose a novel framework incorporating a fine-grained Embedding Router that binds who' and speak what' together to address the audio-to-character correspondence control.
- Score: 72.29811385678168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed remarkable advances in audio-driven talking head generation. However, existing approaches predominantly focus on single-character scenarios. While some methods can create separate conversation videos between two individuals, the critical challenge of generating unified conversation videos with multiple physically co-present characters sharing the same spatial environment remains largely unaddressed. This setting presents two key challenges: audio-to-character correspondence control and the lack of suitable datasets featuring multi-character talking videos within the same scene. To address these challenges, we introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene. Specifically, we propose (1) A novel framework incorporating a fine-grained Embedding Router that binds `who' and `speak what' together to address the audio-to-character correspondence control. (2) Two methods for implementing a 3D-mask embedding router that enables frame-wise, fine-grained control of individual characters, with distinct loss functions based on observed geometric priors and a mask refinement strategy to enhance the accuracy and temporal smoothness of the predicted masks. (3) The first dataset, to the best of our knowledge, specifically constructed for multi-talking-character video generation, and accompanied by an open-source data processing pipeline, and (4) A benchmark for the dual-talking-characters video generation, with extensive experiments demonstrating superior performance over multiple state-of-the-art methods.
Related papers
- HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters [14.594698765723756]
HunyuanVideo-Avatar is a model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos.<n>A character image injection module is designed to replace the conventional addition-based character conditioning scheme.<n>An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video.<n>A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask.
arXiv Detail & Related papers (2025-05-26T15:57:27Z) - Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation [62.56037816595509]
Mask$2$DiT establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations.<n>This attention mechanism enables precise segment-level textual-to-visual alignment.<n>Mask$2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.
arXiv Detail & Related papers (2025-03-25T17:46:50Z) - DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z) - Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance [29.768141136041454]
We propose a novel multi-character video generation framework, which is based on the separated text and pose guidance.<n>Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs.<n>The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation.
arXiv Detail & Related papers (2024-12-21T05:49:40Z) - DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation [60.07447565026327]
We propose DreamRunner, a novel story-to-video generation method.<n>We structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning and fine-grained object-level layout and motion planning.<n>DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos.
arXiv Detail & Related papers (2024-11-25T18:41:56Z) - VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs.<n>The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions.<n> Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.