Fugu-MT 論文翻訳(概要): SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding

論文の概要: SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding

arxiv url: http://arxiv.org/abs/2509.19965v1
Date: Wed, 24 Sep 2025 10:21:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-25 20:53:19.773275
Title: SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding
Title（参考訳）: SynchroRaMa : マルチモーダル感情埋め込みによる唇同期・感情認識顔生成
Authors: Phyo Thet Yee, Dimitrios Kollias, Sudeepta Mishra, Abhinav Dhall,
Abstract要約: SynchroRaMaは、テキストと音声の感情信号を組み合わせることで、マルチモーダルな感情埋め込みを統合する新しいフレームワークである。 SynchroRaMaにはオーディオ・トゥ・モーション(A2M)モジュールが含まれており、入力されたオーディオに合わせてモーションフレームを生成する。ベンチマークデータセットの実験では、SynchroRaMaが最先端よりも優れていることが示されている。
参考スコア（独自算出の注目度）: 22.47072342385842
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image) for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single reference image, restricting the model's ability to represent dynamic changes in actions or attributes across time. To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism. A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall naturalness, motion diversity, and video smoothness. Our project page is available at <https://novicemm.github.io/synchrorama>.
Abstract（参考訳）: 音声駆動の会話顔生成は、特に表現力と自然な人間とアバターの相互作用を必要とするアプリケーションにとって、関心が高まっている。しかし、既存の感情認識手法の多くは、感情を埋め込むための単一のモダリティ(音声や画像)に依存しており、ニュアンスのある感情的手がかりを捉える能力を制限する。さらに、ほとんどのメソッドは単一の参照イメージで条件を定めており、時間をかけてアクションや属性の動的な変更を表現できるモデルの能力を制限する。これらの問題に対処するために、SynchroRaMaは、テキスト(感情分析)と音声(音声に基づく感情認識と音声から派生した原子価-覚醒特徴)の感情信号を組み合わせることで、マルチモーダルな感情埋め込みを統合した新しいフレームワークである。自然な頭部の動きと正確な唇の同期を確保するため、SynchroRaMaはオーディオ・トゥ・モーション(A2M)モジュールを含む。最後に、SynchroRaMaはLarge Language Model (LLM)によって生成されたシーン記述を追加のテキスト入力として組み込んでおり、動的アクションと高レベルのセマンティック属性をキャプチャすることができる。視覚的およびテキスト的手がかりの両方にモデルを条件付けすることで、時間的一貫性と視覚的リアリズムが向上する。ベンチマークデータセットの定量的および定性的な実験は、SynchroRaMaが最先端を上回り、画質、表現保存、モーションリアリズムの改善を実現していることを示している。ユーザ調査により、SynchroRaMaは、全体的な自然性、動きの多様性、ビデオの滑らかさにおいて、競合する手法よりも高い主観的評価が得られることが確認された。私たちのプロジェクトページは <https://novicemm.github.io/synchrorama> で公開されている。

論文の概要: SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding

関連論文リスト