Fugu-MT 論文翻訳(概要): Exploring Phonetic Context in Lip Movement for Authentic Talking Face Generation

論文の概要: Exploring Phonetic Context in Lip Movement for Authentic Talking Face Generation

arxiv url: http://arxiv.org/abs/2305.19556v1
Date: Wed, 31 May 2023 04:50:32 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-01 18:27:48.897161
Title: Exploring Phonetic Context in Lip Movement for Authentic Talking Face Generation
Title（参考訳）: 顔生成のための唇運動における音韻文脈の探索
Authors: Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Abstract要約: 本稿では,音声音声生成のためのコンテキスト認識型Lip-Syncフレームワーク(CALS)を提案する。 CALSは、各電話機をコンテキスト対応リップモーションユニットにマッピングし、後者をコンテキスト対応リップモーションでターゲットIDに誘導する。 LRW, LRS2, HDTFデータセットの実験から,提案したCALSが時間的コンテキストアライメントを効果的に向上することを示した。
参考スコア（独自算出の注目度）: 29.775211740305906
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Talking face generation is the task of synthesizing a natural face synchronous to driving audio. Although much progress has been made in terms of visual quality, lip synchronization, and facial motion of the talking face, current works still struggle to overcome issues of crude and asynchronous lip movement, which can result in puppetry-like animation. We identify that the prior works commonly correlate lip movement with audio at the phone level. However, due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned and stable lip movement. In this respect, we investigate the phonetic context in lip motion for authentic talking face generation. We propose a Context-Aware Lip-Sync framework (CALS), which leverages phonetic context to generate more spatio-temporally aligned and stable lip movement. The CALS comprises an Audio-to-Lip module and a Lip-to-Face module. The former explicitly maps each phone to a contextualized lip motion unit, which guides the latter in synthesizing a target identity with context-aware lip motion. In addition, we introduce a discriminative sync critic that enforces accurate lip displacements within the phonetic context through audio-visual sync loss and visual discriminative sync loss. From extensive experiments on LRW, LRS2, and HDTF datasets, we demonstrate that the proposed CALS effectively enhances spatio-temporal alignment, greatly improving upon the state-of-the-art on visual quality, lip-sync quality, and realness. Finally, we show the authenticity of the generated video through a lip readability test and achieve 97.7% of relative word prediction accuracy to real videos.
Abstract（参考訳）: 話し顔生成は、音声の駆動に同期する自然な顔を合成するタスクである。顔の視覚的品質、唇の同期、顔の動きに関して多くの進歩があったが、現在の作品では、人形のようなアニメーションをもたらす粗雑で非同期な唇の動きの問題を克服することに苦戦している。先行研究では,電話レベルでの唇運動と音声の相関が一般的であった。しかし、孤立した電話が先行または後続の電話に影響される共音声化のため、電話の明瞭度は音韻的文脈によって異なる。したがって、音韻文脈による唇運動のモデル化は、より時空間的に整列し、安定した唇運動を生成することができる。そこで本研究では, 発話顔生成のための唇の動きの音韻文脈について検討する。本研究では,音韻的文脈を利用して,時空間的に整合した安定した唇運動を生成する,文脈認識型リップシンクフレームワーク(cals)を提案する。 CALSはAudio-to-LipモジュールとLip-to-Faceモジュールとを備える。前者は、各スマートフォンをコンテキスト化された唇の動きユニットに明示的にマッピングし、後者は、コンテキスト認識された唇の動きでターゲットのアイデンティティを合成する。また,音声-視覚同期損失と視覚識別同期損失を通じ,音韻文脈内の正確な唇変位を強制する識別同期批判法を提案する。 LRW, LRS2, HDTFデータセットの広範な実験から、提案したCALSが時空間アライメントを効果的に向上し、視覚的品質、リップシンク品質、現実性に関する最先端技術を大幅に改善することを示した。最後に,リップ可読性テストにより生成した映像の真正性を示し,実映像に対する相対的単語予測精度の97.7%を達成する。

論文の概要: Exploring Phonetic Context in Lip Movement for Authentic Talking Face Generation

関連論文リスト