Fugu-MT 論文翻訳(概要): SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization

論文の概要: SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization

arxiv url: http://arxiv.org/abs/2604.13335v1
Date: Tue, 14 Apr 2026 22:55:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.323183
Title: SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization
Title（参考訳）: SEDTalker:フレームレベル音声感情ダイアリゼーションを用いた感情認識型3次元顔アニメーション
Authors: Farzaneh Jafari, Stefano Berretti, Anup Basu,
Abstract要約: SEDTalkerは、音声駆動型3D顔アニメーションのための感情認識フレームワークである。フレームレベルの音声感情ダイアリゼーションを用いて、きめ細かい表現制御を実現する。
参考スコア（独自算出の注目度）: 14.632533340477591
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SEDTalker, an emotion-aware framework for speech-driven 3D facial animation that leverages frame-level speech emotion diarization to achieve fine-grained expressive control. Unlike prior approaches that rely on utterance-level or manually specified emotion labels, our method predicts temporally dense emotion categories and intensities directly from speech, enabling continuous modulation of facial expressions over time. The diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture. This design allows effective disentanglement of linguistic content and emotional style while preserving identity and temporal coherence. We evaluate our approach on a large-scale multi-corpus dataset for speech emotion diarization and on the EmoVOCA dataset for emotional 3D facial animation. Quantitative results demonstrate strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors, while qualitative results show smooth emotion transitions and consistent expression control. These findings highlight the effectiveness of frame-level emotion diarization for expressive and controllable 3D talking head generation.
Abstract（参考訳）: SEDTalkerは、フレームレベルの音声感情ダイアリゼーションを利用して、きめ細かい表現制御を実現する、音声駆動型3D顔アニメーションのための感情認識フレームワークである。発話レベルや手動で特定された感情ラベルに依存する従来の手法とは異なり、我々の手法は時間とともに表情の連続的な変調を可能にするために、時間的に密集した感情カテゴリと強度を直接音声から予測する。ダイアリゼーションされた感情信号は、学習された埋め込みとして符号化され、ハイブリッドトランスフォーマー・マンバアーキテクチャに基づく音声駆動の3Dアニメーションモデルを記述するために使用される。このデザインは、アイデンティティと時間的一貫性を保ちながら、言語内容と感情スタイルを効果的に切り離すことを可能にします。音声感情のダイアリゼーションのための大規模マルチコーパスデータセットと感情の3次元顔アニメーションのためのEmoVOCAデータセットについて評価を行った。定量的にはフレームレベルの感情認識性能と低幾何的・時間的再構成誤差を示し,質的結果はスムーズな感情遷移と一貫した表現制御を示す。これらの結果から, フレームレベルの感情ダイアリゼーションの有効性が示唆された。

論文の概要: SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization

関連論文リスト