The GENEA Challenge 2022: A large evaluation of data-driven co-speech
gesture generation
- URL: http://arxiv.org/abs/2208.10441v1
- Date: Mon, 22 Aug 2022 16:55:02 GMT
- Title: The GENEA Challenge 2022: A large evaluation of data-driven co-speech
gesture generation
- Authors: Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor
Nikolov, Mihail Tsakov, Gustav Eje Henter
- Abstract summary: This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation.
Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation.
Some synthetic conditions are rated as significantly more human-like than human motion capture.
- Score: 9.661373458482291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper reports on the second GENEA Challenge to benchmark data-driven
automatic co-speech gesture generation. Participating teams used the same
speech and motion dataset to build gesture-generation systems. Motion generated
by all these systems was rendered to video using a standardised visualisation
pipeline and evaluated in several large, crowdsourced user studies. Unlike when
comparing different research papers, differences in results are here only due
to differences between methods, enabling direct comparison between systems.
This year's dataset was based on 18 hours of full-body motion capture,
including fingers, of different persons engaging in dyadic conversation. Ten
teams participated in the challenge across two tiers: full-body and upper-body
gesticulation. For each tier we evaluated both the human-likeness of the
gesture motion and its appropriateness for the specific speech signal. Our
evaluations decouple human-likeness from gesture appropriateness, which
previously was a major challenge in the field.
The evaluation results are a revolution, and a revelation. Some synthetic
conditions are rated as significantly more human-like than human motion
capture. To the best of our knowledge, this has never been shown before on a
high-fidelity avatar. On the other hand, all synthetic motion is found to be
vastly less appropriate for the speech than the original motion-capture
recordings. Additional material is available via the project website at
https://youngwoo-yoon.github.io/GENEAchallenge2022/
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - The GENEA Challenge 2023: A large scale evaluation of gesture generation
models in monadic and dyadic settings [8.527975206444742]
This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems.
We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies.
We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap.
arXiv Detail & Related papers (2023-08-24T08:42:06Z) - Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022 [8.822263327342071]
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation.
Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation.
We evaluate both the human-likeness of the gesture motion and its appropriateness for the specific speech signal.
arXiv Detail & Related papers (2023-03-15T16:21:50Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with
Generative Adversarial Affective Expression Learning [63.06044724907101]
We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions.
Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences.
arXiv Detail & Related papers (2021-07-31T15:13:39Z) - Socially and Contextually Aware Human Motion and Pose Forecasting [48.083060946226]
We propose a novel framework to tackle both tasks of human motion (or skeleton pose) and body skeleton pose forecasting.
We consider incorporating both scene and social contexts, as critical clues for this prediction task.
Our proposed framework achieves a superior performance compared to several baselines on two social datasets.
arXiv Detail & Related papers (2020-07-14T06:12:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.