EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans
- URL: http://arxiv.org/abs/2512.01340v1
- Date: Mon, 01 Dec 2025 06:56:40 GMT
- Title: EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans
- Authors: Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang, Farong Wen, Yu Zhou, Jiezhang Cao, Xiongkuo Min, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu,
- Abstract summary: We present THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset.<n>We analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion.<n>We introduce EvalTalker, a novel TH quality assessment framework.
- Score: 86.21111833841684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.
Related papers
- MPCEval: A Benchmark for Multi-Party Conversation Generation [23.227067535888768]
We introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation.<n>MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency.<n>We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations.
arXiv Detail & Related papers (2026-03-05T09:07:00Z) - Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction [12.216811577733125]
We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns.<n>We introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking.<n>We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline.
arXiv Detail & Related papers (2025-12-16T19:26:44Z) - PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards [86.1965460124838]
We propose a scalable multi-subject data generation pipeline.<n>We first enable single-subject personalization models to acquire knowledge of multi-image and multi-subject scenarios.<n>To enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards.
arXiv Detail & Related papers (2025-12-01T03:25:49Z) - AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement [30.435102560798455]
We propose AnyTalker, a multi-person generation framework that features an multi-stream processing architecture.<n>We extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs.<n>Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips.
arXiv Detail & Related papers (2025-11-28T18:59:01Z) - VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion [18.017186369021154]
VOGUE is a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios.<n>Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants.<n>Our initial analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue.
arXiv Detail & Related papers (2025-10-24T04:45:29Z) - TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis [74.31705485094096]
We introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers.<n>TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail.<n>We construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes.
arXiv Detail & Related papers (2025-08-19T08:31:15Z) - Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads [53.012111671763776]
Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements.<n>With the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media.<n>This paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts.
arXiv Detail & Related papers (2025-07-31T08:43:21Z) - MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z) - Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation [34.15566431966277]
We propose a novel task: Multi-Person Conversational Video Generation.<n>We introduce a new framework, MultiTalk, to address the challenges during multi-person generation.
arXiv Detail & Related papers (2025-05-28T17:57:06Z) - OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions [62.19092662469285]
Online Multimodal Conversational Response Generation (OMCRG) is a novel task designed to produce synchronized verbal and non-verbal listener feedback online.<n>We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses.<n>We offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors.
arXiv Detail & Related papers (2025-05-27T20:12:46Z) - Multimodal Conversation Structure Understanding [12.29827265137757]
Large language models' ability to understand fine-grained conversational structure remains underexplored.<n>We present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants.<n>We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging.
arXiv Detail & Related papers (2025-05-23T06:41:54Z) - X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents [56.64615470513102]
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations.<n>Traditional setting limits each participant to one message at a time and requires constant human participation.<n>This paper proposes textbftextscX-Turing, which enhances the original test with a textitburst dialogue pattern.
arXiv Detail & Related papers (2024-08-19T09:57:28Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.