The GENEA Challenge 2023: A large scale evaluation of gesture generation
models in monadic and dyadic settings
- URL: http://arxiv.org/abs/2308.12646v1
- Date: Thu, 24 Aug 2023 08:42:06 GMT
- Title: The GENEA Challenge 2023: A large scale evaluation of gesture generation
models in monadic and dyadic settings
- Authors: Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor
Nikolov, Mihail Tsakov, Gustav Eje Henter
- Abstract summary: This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems.
We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies.
We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap.
- Score: 8.527975206444742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper reports on the GENEA Challenge 2023, in which participating teams
built speech-driven gesture-generation systems using the same speech and motion
dataset, followed by a joint evaluation. This year's challenge provided data on
both sides of a dyadic interaction, allowing teams to generate full-body motion
for an agent given its speech (text and audio) and the speech and motion of the
interlocutor. We evaluated 12 submissions and 2 baselines together with
held-out motion-capture data in several large-scale user studies. The studies
focused on three aspects: 1) the human-likeness of the motion, 2) the
appropriateness of the motion for the agent's own speech whilst controlling for
the human-likeness of the motion, and 3) the appropriateness of the motion for
the behaviour of the interlocutor in the interaction, using a setup that
controls for both the human-likeness of the motion and the agent's own speech.
We found a large span in human-likeness between challenge submissions, with a
few systems rated close to human mocap. Appropriateness seems far from being
solved, with most submissions performing in a narrow range slightly above
chance, far behind natural motion. The effect of the interlocutor is even more
subtle, with submitted systems at best performing barely above chance.
Interestingly, a dyadic system being highly appropriate for agent speech does
not necessarily imply high appropriateness for the interlocutor. Additional
material is available via the project website at
https://svito-zar.github.io/GENEAchallenge2023/ .
Related papers
- Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions [66.87211993793807]
We present ReMoS, a denoising diffusion-based model that synthesizes full-body reactive motion of a person in a two-person interaction scenario.
We demonstrate ReMoS across challenging two-person scenarios such as pair-dancing, Ninjutsu, kickboxing, and acrobatics.
We also contribute the ReMoCap dataset for two-person interactions containing full-body and finger motions.
arXiv Detail & Related papers (2023-11-28T18:59:52Z) - InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions [49.097973114627344]
We present InterGen, an effective diffusion-based approach that incorporates human-to-human interactions into the motion diffusion process.
We first contribute a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 23,337 natural language descriptions.
We propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame.
arXiv Detail & Related papers (2023-04-12T08:12:29Z) - Task-Oriented Human-Object Interactions Generation with Implicit Neural
Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations.
Our method generates continuous motions that are parameterized only by the temporal coordinate.
This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z) - Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022 [8.822263327342071]
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation.
Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation.
We evaluate both the human-likeness of the gesture motion and its appropriateness for the specific speech signal.
arXiv Detail & Related papers (2023-03-15T16:21:50Z) - The GENEA Challenge 2022: A large evaluation of data-driven co-speech
gesture generation [9.661373458482291]
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation.
Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation.
Some synthetic conditions are rated as significantly more human-like than human motion capture.
arXiv Detail & Related papers (2022-08-22T16:55:02Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z) - Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation
of Facial Gestures in Dyadic Settings [11.741529272872219]
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors.
Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior.
We introduce a probabilistic method to synthesize interlocutor-aware facial gestures in dyadic conversations.
arXiv Detail & Related papers (2020-06-11T14:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.