Related papers: TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

URL: http://arxiv.org/abs/2508.13618v1
Date: Tue, 19 Aug 2025 08:31:15 GMT
Title: TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
Authors: Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang,
Abstract summary: We introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers.<n>TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail.<n>We construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes.
Score: 74.31705485094096
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

Related papers

JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 [6.4645943969421875]
This paper introduces the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1)<n>It comprises a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars.<n>Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance.
arXiv Detail & Related papers (2025-07-28T16:47:44Z)
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation [45.27083162088965]
SpeakerVid-5M is the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation.<n>Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits.
arXiv Detail & Related papers (2025-07-14T02:22:47Z)
UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios [22.15198429228792]
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection.<n>UniTalk focuses explicitly on diverse and difficult real-world conditions.<n>It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities.
arXiv Detail & Related papers (2025-05-28T04:08:59Z)
Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation [9.286959744769792]
Cross-lingual generalization of objective speech quality models is a major challenge.<n>Models trained primarily on English data may struggle to generalize to languages with different phonetic, tonal, and prosodic characteristics.<n>This study investigates the cross-lingual performance of two speech quality models: NISQA, a CNN-based model, and a Transformer-based Audio Spectrogram Transformer (AST) model.
arXiv Detail & Related papers (2025-02-18T16:22:43Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z)
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents [70.08842857515141]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.<n>Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z)
ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z)
ASR4REAL: An extended benchmark for speech models [19.348785785921446]
We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent. All tested models show a strong performance drop when tested on conversational speech.
arXiv Detail & Related papers (2021-10-16T14:34:25Z)
UniCon: Unified Context Network for Robust Active Speaker Detection [111.90529347692723]
We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD) Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information. A thorough ablation study is performed on several challenging ASD benchmarks under different settings.
arXiv Detail & Related papers (2021-08-05T13:25:44Z)
Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z)
The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements [14.707930573950787]
We present MuSe-CaR, a first of its kind multimodal dataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge.
arXiv Detail & Related papers (2021-01-15T10:40:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.