Related papers: EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

URL: http://arxiv.org/abs/2512.01340v1
Date: Mon, 01 Dec 2025 06:56:40 GMT
Title: EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans
Authors: Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang, Farong Wen, Yu Zhou, Jiezhang Cao, Xiongkuo Min, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu,
Abstract summary: We present THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset.<n>We analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion.<n>We introduce EvalTalker, a novel TH quality assessment framework.
Score: 86.21111833841684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.

Related papers

MPCEval: A Benchmark for Multi-Party Conversation Generation [23.227067535888768]
We introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation.<n>MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency.<n>We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations.
arXiv Detail & Related papers (2026-03-05T09:07:00Z)
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction [12.216811577733125]
We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns.<n>We introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking.<n>We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline.
arXiv Detail & Related papers (2025-12-16T19:26:44Z)
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards [86.1965460124838]
We propose a scalable multi-subject data generation pipeline.<n>We first enable single-subject personalization models to acquire knowledge of multi-image and multi-subject scenarios.<n>To enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards.
arXiv Detail & Related papers (2025-12-01T03:25:49Z)
AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement [30.435102560798455]
We propose AnyTalker, a multi-person generation framework that features an multi-stream processing architecture.<n>We extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs.<n>Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips.
arXiv Detail & Related papers (2025-11-28T18:59:01Z)
VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion [18.017186369021154]
VOGUE is a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios.<n>Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants.<n>Our initial analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue.
arXiv Detail & Related papers (2025-10-24T04:45:29Z)
TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis [74.31705485094096]
We introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers.<n>TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail.<n>We construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes.
arXiv Detail & Related papers (2025-08-19T08:31:15Z)
Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads [53.012111671763776]
Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements.<n>With the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media.<n>This paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts.
arXiv Detail & Related papers (2025-07-31T08:43:21Z)
MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z)
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation [34.15566431966277]
We propose a novel task: Multi-Person Conversational Video Generation.<n>We introduce a new framework, MultiTalk, to address the challenges during multi-person generation.
arXiv Detail & Related papers (2025-05-28T17:57:06Z)
OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions [62.19092662469285]
Online Multimodal Conversational Response Generation (OMCRG) is a novel task designed to produce synchronized verbal and non-verbal listener feedback online.<n>We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses.<n>We offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors.
arXiv Detail & Related papers (2025-05-27T20:12:46Z)
Multimodal Conversation Structure Understanding [12.29827265137757]
Large language models' ability to understand fine-grained conversational structure remains underexplored.<n>We present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants.<n>We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging.
arXiv Detail & Related papers (2025-05-23T06:41:54Z)
X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents [56.64615470513102]
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations.<n>Traditional setting limits each participant to one message at a time and requires constant human participation.<n>This paper proposes textbftextscX-Turing, which enhances the original test with a textitburst dialogue pattern.
arXiv Detail & Related papers (2024-08-19T09:57:28Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.