Fugu-MT 論文翻訳(概要): Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

論文の概要: Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

arxiv url: http://arxiv.org/abs/2108.00262v2
Date: Tue, 3 Aug 2021 10:35:44 GMT
ステータス: 翻訳完了
システム内更新日: 2021-08-05 02:21:31.966398
Title: Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning
Title（参考訳）: Speech2Affective Gestures: 対人感情表現学習による音声合成
Authors: Uttaran Bhattacharya and Elizabeth Childs and Nicholas Rewkowski and Dinesh Manocha
Abstract要約: そこで本稿では, 感情表現を適切に表現し, 3次元ポーズを合成する生成的対人ネットワークを提案する。本ネットワークは,入力音声とシードポーズから符号化された特徴の組込み空間からジェスチャを合成するジェネレータと,合成されたポーズシーケンスと実3Dポーズシーケンスを識別する識別器とから構成される。
参考スコア（独自算出の注目度）: 63.06044724907101
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective cues. We design an affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features. We use our affective encoder in both our generator, where it learns affective features from the seed poses to guide the gesture synthesis, and our discriminator, where it enforces the synthesized gestures to contain the appropriate affective expressions. We perform extensive evaluations on two benchmark datasets for gesture synthesis from the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared to the best baselines, we improve the mean absolute joint error by 10--33%, the mean acceleration difference by 8--58%, and the Fr\'echet Gesture Distance by 21--34%. We also conduct a user study and observe that compared to the best current baselines, around 15.28% of participants indicated our synthesized gestures appear more plausible, and around 16.32% of participants felt the gestures had more appropriate affective expressions aligned with the speech.
Abstract（参考訳）: そこで本稿では, 感情表現を適切に表現し, 3次元ポーズを合成する生成的対人ネットワークを提案する。本ネットワークは,入力音声とシードポーズから符号化された特徴の組込み空間からジェスチャを合成するジェネレータと,合成されたポーズシーケンスと実3Dポーズシーケンスを識別する識別器とから構成される。我々は,入力音声から出力されるメル周波数ケプストラム係数とテキストの書き起こしを利用して,所望の感情と関連する感情の手がかりを学習する。マルチスケール空間時間グラフ畳み込みを用いた情緒的エンコーダを設計し,3次元ポーズ列を潜在ポーズに基づく情緒的特徴に変換する。私たちは、私たちのジェネレータの両方で、感情エンコーダを使って、種子のポーズから感情的な特徴を学び、ジェスチャー合成をガイドし、私たちの識別器は、適切な感情的な表現を含むように、合成されたジェスチャーを強制します。音声からのジェスチャー合成のための2つのベンチマークデータセット、TED Gesture DatasetとGENEA Challenge 2020 Datasetについて広範な評価を行った。最良ベースラインと比較して,平均絶対関節誤差を10～33%,平均加速度差を8～58%,Fr'echet Gesture Distanceを21～34%改善した。また, 被験者の約15.28%が, 合成したジェスチャーの方が分かりやすいと回答し, 被験者の約16.32%は, ジェスチャーが発話に合った感情表現を持っていると感じた。

論文の概要: Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

関連論文リスト