Fugu-MT 論文翻訳(概要): Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

論文の概要: Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

arxiv url: http://arxiv.org/abs/2601.12876v1
Date: Mon, 19 Jan 2026 09:31:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-21 22:47:22.83423
Title: Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation
Title（参考訳）: 音声保存表情操作に先立って隣接フレームを用いた発話頭部モデルの探索
Authors: Zhenxuan Lu, Zhihua Xu, Zhijing Yang, Feng Gao, Yongyi Lu, Keze Wang, Tianshui Chen,
Abstract要約: 音声保存表情マニピュレーション(SPFEM)は,画像やビデオの表情変化を目的とした革新的な技術である。進歩にもかかわらず、SPFEMは、表情と口の形の間の複雑な相互作用のために、正確な唇の同期に苦慮している。本稿では、AD-THGモデルを用いて正確な唇の動きを同期したフレームを生成する新しいフレームワークTHFEM(Talking Head Facial Expression Manipulation)を提案する。
参考スコア（独自算出の注目度）: 34.89590516635867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos while retaining the original mouth movements. Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes. Capitalizing on the advanced capabilities of audio-driven talking head generation (AD-THG) models in synthesizing precise lip movements, our research introduces a novel integration of these models with SPFEM. We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements from audio inputs and SPFEM-altered images. However, increasing the number of frames generated by AD-THG models tends to compromise the realism and expression fidelity of the images. To counter this, we develop an adjacent frame learning strategy that finetunes AD-THG models to predict sequences of consecutive frames. This strategy enables the models to incorporate information from neighboring frames, significantly improving image quality during testing. Our extensive experimental evaluations demonstrate that this framework effectively preserves mouth shapes during expression manipulations, highlighting the substantial benefits of integrating AD-THG with SPFEM.
Abstract（参考訳）: 音声保存表情マニピュレーション (SPFEM) は, 口の動きを保ちながら画像やビデオの表情を変化させることを目的とした革新的な技術である。進歩にもかかわらず、SPFEMは、表情と口の形の間の複雑な相互作用のために、正確な唇の同期に苦慮している。本研究は,音声駆動音声ヘッド生成(AD-THG)モデルによる口唇運動の精密合成機能を活用し,これらのモデルとSPFEMとの新たな統合を提案する。本稿では、AD-THGモデルを用いて音声入力とSPFEM変換画像から正確な唇の動きを同期したフレームを生成する新しいフレームワークTHFEMを提案する。しかし、AD-THGモデルによって生成されるフレーム数が増加すると、画像のリアリズムや表現の忠実さを損なう傾向にある。これに対応するために,AD-THGモデルを微調整して連続フレームのシーケンスを予測するフレーム学習戦略を開発した。この戦略により、モデルは近隣のフレームからの情報を組み込むことができ、テスト中の画像品質を大幅に改善することができる。この枠組みは, 表情操作時の口の形状を効果的に保ち, SPFEMとAD-THGの統合による実質的な利点を浮き彫りにしている。

論文の概要: Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

関連論文リスト