Fugu-MT 論文翻訳(概要): MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

論文の概要: MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

arxiv url: http://arxiv.org/abs/2605.08050v1
Date: Fri, 08 May 2026 17:40:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.246768
Title: MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation
Title（参考訳）: MoCoTalk: 制御可能なトーキングヘッド生成のための適応ルータ付きマルチコンディショナル拡散
Authors: Xinyan Ye, Jiankang Deng, Abbas Edalat,
Abstract要約: MoCoTalkは、4つの相補的な制御信号を統一する多条件ビデオ拡散フレームワークである。 Adaptive Multi-Condition Routerは、4つの条件ストリーム上のチャネルワイドでタイムステップ対応のゲーティングを計算する。 Mouth-Augmented Shading Meshは3DMMベースの表現で、頭部の動き、口の動き、表情、照明を分離する。
参考スコア（独自算出の注目度）: 45.88028371034407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.
Abstract（参考訳）: トーキングヘッド生成には、アイデンティティ、頭ポーズ、表情、口のダイナミックスを共同でモデリングする必要がある。既存の方法は通常これらの要素のサブセットにのみ対応し、複数の条件が関係している場合、固定重またはヒューリスティック融合に依存する。提案するMoCoTalkは,参照画像,顔キーポイント,3DMMレンダリングシェーディングメッシュ,および対応する音声音声の4つの相補的制御信号を統一する多条件ビデオ拡散フレームワークである。不均一な条件間の破壊的干渉を解決するために,4つの条件ストリーム上のチャネルワイドでタイムステップ対応のゲーティングを演算し,特徴部分空間と雑音レベルの両方で融合戦略を変更できる適応型マルチコンディション・ルータを導入する。そこで我々は,頭部の動き,口の動き,表情,照明を3DMMで表現するMouth-Augmented Shading Meshを設計した。この設計は時間的に一貫した幾何学的事前を提供し、推論時にこれらの属性の柔軟な再結合を可能にする。さらに,音声・視覚的アライメントの強化を目的とした唇の整合性低下も導入する。大規模な実験によると、MoCoTalkは、構造的、運動的、知覚的メトリクスの大部分で最先端のパフォーマンスを達成し、一方、単一条件法では提供されない属性レベルの制御性を提供する。

論文の概要: MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

関連論文リスト