Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022
- URL: http://arxiv.org/abs/2303.08737v2
- Date: Thu, 28 Mar 2024 16:59:24 GMT
- Title: Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022
- Authors: Taras Kucherenko, Pieter Wolfert, Youngwoo Yoon, Carla Viegas, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter,
- Abstract summary: This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation.
Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation.
We evaluate both the human-likeness of the gesture motion and its appropriateness for the specific speech signal.
- Score: 8.822263327342071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fr\'echet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around $-0.5$. Based on the challenge results we formulate numerous recommendations for system building and evaluation.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - The GENEA Challenge 2023: A large scale evaluation of gesture generation
models in monadic and dyadic settings [8.527975206444742]
This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems.
We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies.
We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap.
arXiv Detail & Related papers (2023-08-24T08:42:06Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - Audio2Gestures: Generating Diverse Gestures from Audio [28.026220492342382]
We propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code.
Our method generates more realistic and diverse motions than previous state-of-the-art methods.
arXiv Detail & Related papers (2023-01-17T04:09:58Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - The GENEA Challenge 2022: A large evaluation of data-driven co-speech
gesture generation [9.661373458482291]
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation.
Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation.
Some synthetic conditions are rated as significantly more human-like than human motion capture.
arXiv Detail & Related papers (2022-08-22T16:55:02Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with
Generative Adversarial Affective Expression Learning [63.06044724907101]
We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions.
Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences.
arXiv Detail & Related papers (2021-07-31T15:13:39Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Socially and Contextually Aware Human Motion and Pose Forecasting [48.083060946226]
We propose a novel framework to tackle both tasks of human motion (or skeleton pose) and body skeleton pose forecasting.
We consider incorporating both scene and social contexts, as critical clues for this prediction task.
Our proposed framework achieves a superior performance compared to several baselines on two social datasets.
arXiv Detail & Related papers (2020-07-14T06:12:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.