Fugu-MT 論文翻訳(概要): InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

論文の概要: InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

arxiv url: http://arxiv.org/abs/2603.23132v1
Date: Tue, 24 Mar 2026 12:27:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.47514
Title: InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance
Title（参考訳）: InterDyad: 中間視覚誘導による対話型対話型音声合成
Authors: Dongwei Pan, Longwei Guo, Jiazhi Guan, Luying Huang, Yiding Li, Haojie Liu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou,
Abstract要約: 対話型ダイアディック力学の合成フレームワークであるInterDyadを提案する。我々はまず、参照ビデオから抽出されたアイデンティティ非依存の動作先に基づいて、ビデオの再現を実現するInteractiveを設計する。 MLLM(Multimodal Large Language Model)を利用して,音声から言語意図を抽出し,反応の正確なタイミングと適切性を決定する。包括的実験により、InterDyadは、自然と文脈的に基底付けられた2人のインタラクションを生成において、最先端の手法を著しく上回っていることが示された。
参考スコア（独自算出の注目度）: 20.740979380270126
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.
Abstract（参考訳）: 音声とビデオの合成の進歩にもかかわらず、既存の手法はしばしば、個人間の依存関係を捕捉し、ダイアディック環境でのリアクティブな振る舞いをきめ細かな制御に苦慮している。これらの課題に対処するために、構造的動作ガイダンスをクエリすることで、自然主義的インタラクティブなダイナミクス合成を可能にするフレームワークであるInterDyadを提案する。具体的には、まず、参照ビデオから抽出されたアイデンティティ非依存の動作先に基づいて、ビデオの再現を実現するInteractive Injectorを設計する。そこで我々は,MetaQueryに基づくモーメントアライメント機構を導入し,対話型音声とこれらの動作先行のギャップを埋める。 MLLM(Multimodal Large Language Model)を利用して,音声から言語意図を抽出し,反応の正確なタイミングと適切性を決定する。極端頭部ポーズ下での口唇音質の向上を目的として, 口唇音の強調と空間的整合性向上を目的としたロールアウェアDyadic Gaussian Guidance (RoDG)を提案する。最後に, Dyadic インタラクションを定量化するための, 新しく設計されたメトリクスを備えた専用評価スイートを提案する。包括的実験により、InterDyadは、自然と文脈的に基底付けられた2人のインタラクションを生成において、最先端の手法を著しく上回っていることが示された。デモビデオについては、こちらのプロジェクトページを参照してください。

論文の概要: InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

関連論文リスト