A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
- URL: http://arxiv.org/abs/2301.05339v4
- Date: Mon, 10 Apr 2023 09:11:59 GMT
- Title: A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
- Authors: Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje
Henter, Michael Neff
- Abstract summary: The automatic generation of such co-speech gestures is a long-standing problem in computer animation.
Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion.
This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models.
- Score: 11.948557523215316
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.
Related papers
- Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion.
In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation.
We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z) - Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents [17.299991009921307]
This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures.
Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void.
Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis.
arXiv Detail & Related papers (2024-08-07T23:23:50Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - Human Motion Generation: A Survey [67.38982546213371]
Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications.
Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts.
We present a comprehensive literature review of human motion generation, which is the first of its kind in this field.
arXiv Detail & Related papers (2023-07-20T14:15:20Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.