Fugu-MT 論文翻訳(概要): ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

論文の概要: ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

arxiv url: http://arxiv.org/abs/2603.17427v1
Date: Wed, 18 Mar 2026 07:07:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.559529
Title: ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation
Title（参考訳）: ECHO: 感情的適切かつコンテキスト的対話型ヘッドジェネレーションを目指して
Authors: Xiangyu Kong, Xiaoyu Jin, Yihan Pan, Haoqin Sun, Hengde Zhu, Xiaoming Xu, Xiaoming Wei, Lu Liu, Siyang Song,
Abstract要約: 対話型ヘッドジェネレーション (Interactive Head Generation, IHG) は、このような機能をエミュレートしたライフライクなアバターヘッドビデオを合成することを目的としている。 ECHOは、Long-range Contextual Understanding (LCU) コンポーネントとブロックワイド空間対応のDecoupled Cross-attention Modulation (SDCM) モジュールの2つの主要なコンポーネントからなる新しいIHGフレームワークである。
参考スコア（独自算出の注目度）: 37.457960520410246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.
Abstract（参考訳）: 自然な対面相互作用では、参加者は会話と聞き取りをシームレスに交互に交互に行い、長距離の文脈によって微妙に知らされ、文脈的適切さと感情的合理性を示す顔行動(FB)を生成する。対話型ヘッドジェネレーション (Interactive Head Generation, IHG) は、このような機能をエミュレートしたライフライクなアバターヘッドビデオを合成することを目的としている。既存のIHG法は、通常は2トラック信号(例えば、人間の行動とアバターのための事前定義されたオーディオ)を短時間の窓の中で条件付けし、アバターの音声対応唇調音と非言語的FBの生成を共同で駆動する。しかし、これらの方法には2つの大きな課題が残っている。 (i)長期的文脈モデリングを伴わない短時間の行動手段への依存は、文脈的適切性に欠ける顔行動を生み出すことにつながる。 (II) 二重トラック信号の絡み合った役割に依存しない融合は, 交叉干渉を経験的に導入し, 発話中の唇領域同期を阻害する可能性がある。本研究の目的は,2つの重要なコンポーネントからなる新しいIHGフレームワークであるECHOを提案することである。Long-range Contextual Understanding (LCU) コンポーネントは,Avatar FBsの文脈的適切性と感情的合理性を促進するために,行動基底力学と言語駆動的情緒的意味論の両方の文脈的理解を促進する。広範囲にわたる実験は、提案されたコンポーネントの有効性とECHOの優れたIHG性能を示す。

論文の概要: ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

関連論文リスト