AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
- URL: http://arxiv.org/abs/2511.23475v1
- Date: Fri, 28 Nov 2025 18:59:01 GMT
- Title: AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
- Authors: Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, Ying Qin, Huan Li, Shuiyang Mao, Wei Liu, Wenhan Luo,
- Abstract summary: We propose AnyTalker, a multi-person generation framework that features an multi-stream processing architecture.<n>We extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs.<n>Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips.
- Score: 30.435102560798455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.
Related papers
- EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans [86.21111833841684]
We present THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset.<n>We analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion.<n>We introduce EvalTalker, a novel TH quality assessment framework.
arXiv Detail & Related papers (2025-12-01T06:56:40Z) - PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards [86.1965460124838]
We propose a scalable multi-subject data generation pipeline.<n>We first enable single-subject personalization models to acquire knowledge of multi-image and multi-subject scenarios.<n>To enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards.
arXiv Detail & Related papers (2025-12-01T03:25:49Z) - TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation [76.48551690189406]
We present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation.<n>TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views.<n>The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation.
arXiv Detail & Related papers (2025-10-08T17:16:09Z) - BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z) - Multi-human Interactive Talking Dataset [20.920129008402718]
We introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation.<n>The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers.<n>It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors.
arXiv Detail & Related papers (2025-08-05T03:54:18Z) - Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation [34.15566431966277]
We propose a novel task: Multi-Person Conversational Video Generation.<n>We introduce a new framework, MultiTalk, to address the challenges during multi-person generation.
arXiv Detail & Related papers (2025-05-28T17:57:06Z) - Multi-identity Human Image Animation with Structural Video Diffusion [73.38728096088732]
emph Structural Video Diffusion is a novel framework for generating realistic multi-human videos.<n>Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals.<n>We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z) - Towards Open Domain Text-Driven Synthesis of Multi-Person Motions [36.737740727883924]
We curate human pose and motion datasets by estimating pose information from large-scale image and video datasets.
Our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.
arXiv Detail & Related papers (2024-05-28T18:00:06Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - Audio-Visual Fusion Layers for Event Type Aware Video Recognition [86.22811405685681]
We propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme.
We show that our network is formulated with single labels, but it can output additional true multi-labels to represent the given videos.
arXiv Detail & Related papers (2022-02-12T02:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.