Gesticulator: A framework for semantically-aware speech-driven gesture
generation
- URL: http://arxiv.org/abs/2001.09326v5
- Date: Thu, 14 Jan 2021 16:29:20 GMT
- Title: Gesticulator: A framework for semantically-aware speech-driven gesture
generation
- Authors: Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter,
Simon Alexanderson, Iolanda Leite, Hedvig Kjellstr\"om
- Abstract summary: We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
- Score: 17.284154896176553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: During speech, people spontaneously gesticulate, which plays a key role in
conveying information. Similarly, realistic co-speech gestures are crucial to
enable natural and smooth interactions with social agents. Current end-to-end
co-speech gesture generation systems use a single modality for representing
speech: either audio or text. These systems are therefore confined to producing
either acoustically-linked beat gestures or semantically-linked gesticulation
(e.g., raising a hand when saying "high"): they cannot appropriately learn to
generate both gesture types. We present a model designed to produce arbitrary
beat and semantic gestures together. Our deep-learning based model takes both
acoustic and semantic representations of speech as input, and generates
gestures as a sequence of joint angle rotations as output. The resulting
gestures can be applied to both virtual agents and humanoid robots. Subjective
and objective evaluations confirm the success of our approach. The code and
video are available at the project page
https://svito-zar.github.io/gesticulator .
Related papers
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness [45.90256126021112]
We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
arXiv Detail & Related papers (2024-01-07T13:01:29Z) - LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z) - QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.