Fugu-MT 論文翻訳(概要): CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation

論文の概要: CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation

arxiv url: http://arxiv.org/abs/2605.29316v1
Date: Thu, 28 May 2026 03:46:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.646095
Title: CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation
Title（参考訳）: CapTalk: テキストガイドによるスティル化と音声駆動型3Dヘッドアニメーション
Authors: Xuangeng Chu, Yuan Gan, Ziteng Cui, Shuhong Liu, Jian Wang, Bing Zhou, Tatsuya Harada,
Abstract要約: 我々は,話し方と文字感情のテキスト記述と,駆動音声ストリームの両方を入力するモデルを構築した。我々のモデルは、推論中の動的感情制御をサポートし、ターゲット感情が音声全体にわたって変化するシナリオを処理できる。
参考スコア（独自算出の注目度）: 44.15338767557179
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles. Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion. Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.
Abstract（参考訳）: 音声駆動型3D顔アニメーションは、任意の音声クリップから、同期した唇の動きと鮮やかな表情を生成することを目的としている。既存の手法は、同期された唇の動きを生成できるが、しばしば事前に定義されたアイデンティティやスタイルの潜在機能に依存しており、ユーザーは自由に話すスタイルを制御できる能力に制限される。さらに、オーディオセグメント全体に固定されたスタイルやアイデンティティを適用すると、通常、音声の感情的内容に適応しない顔のアニメーションスタイルが得られる。これらの課題に対処するために、スタイルと感情の絡み合いを再考し、スタイルと感情の双方をテキストで記述した大規模データセットを構築し、スタイルと感情の分離制御を可能にする新しい音声ヘッド生成フレームワークを提案する。提案モデルでは, 発話スタイルと文字感情のテキスト記述と, 駆動音声ストリームの両方を入力とし, 提示した記述にマッチした, 高度に同期した唇の動きと表情をリアルタイムに生成する。さらに、本モデルでは、推論中の動的感情制御をサポートし、ターゲット感情が音声全体にわたって変化するシナリオを処理できる。

論文の概要: CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation

関連論文リスト