Fugu-MT 論文翻訳(概要): TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

論文の概要: TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

arxiv url: http://arxiv.org/abs/2508.13618v1
Date: Tue, 19 Aug 2025 08:31:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.84754
Title: TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
Title（参考訳）: TalkVid: 音声駆動型トーキングヘッド合成のための大規模分散データセット
Authors: Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang,
Abstract要約: 7729のユニークなスピーカーから1244時間のビデオを含む、大規模で高品質で多様なデータセットであるTalkVidを紹介した。 TalkVidは、動作の安定性、美的品質、顔のディテールを厳格にフィルタする、原則付き多段階自動パイプラインを通じてキュレートされる。 TalkVid-Benchは、500クリップの階層化された評価セットで、重要な人口統計学と言語学の軸間で慎重にバランスを取ります。
参考スコア（独自算出の注目度）: 74.31705485094096
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid
Abstract（参考訳）: 音声駆動音声ヘッド合成は驚くべきフォトリアリズムを実現しているが、最先端のSOTA(State-of-the-art)モデルは、民族、言語、年齢グループにおける人間の多様性の完全なスペクトルへの一般化を欠いている。この一般化ギャップは、必要規模、品質、多様性に欠ける既存のトレーニングデータの制限の直接的な症状である、と我々は主張する。この課題に対処するために、7729のユニークなスピーカーから1244時間のビデオを含む、大規模で高品質で多様なデータセットであるTalkVidを紹介した。 TalkVidは、動作の安定性、美的品質、顔のディテールを厳格にフィルタする、原則付き多段階自動パイプラインを通じてキュレートされ、信頼性を確保するために人間の判断に対して検証される。さらに,500クリップの階層化された評価セットであるTalkVid-Benchを,重要な人口統計学的および言語学的軸間で正確にバランスよく構築し,リリースする。実験により、TalkVidでトレーニングされたモデルは、以前のデータセットでトレーニングされたモデルよりも優れ、より優れたクロスデータセットの一般化を示すことが示された。重要な点として、TalkVid-Benchの分析では、従来の集約メトリクスによって隠蔽されているサブグループ間のパフォーマンス格差が明らかとなり、今後の研究の必要性が浮き彫りになっている。コードとデータはhttps://github.com/FreedomIntelligence/TalkVidにある。

論文の概要: TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

関連論文リスト