A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
        - URL: http://arxiv.org/abs/2301.05339v4
- Date: Mon, 10 Apr 2023 09:11:59 GMT
- Title: A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
- Authors: Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje
  Henter, Michael Neff
- Abstract summary: The automatic generation of such co-speech gestures is a long-standing problem in computer animation.
Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion.
This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models.
- Score: 11.948557523215316
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract:   Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.
 
      
        Related papers
        - Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale   Dataset [113.25650486482762]
 We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
 arXiv  Detail & Related papers  (2025-06-27T18:09:49Z)
- Large Language Models for Virtual Human Gesture Selection [0.3749861135832072]
 Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions.
We use the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures.
 arXiv  Detail & Related papers  (2025-03-18T16:49:56Z)
- HoloGest: Decoupled Diffusion and Motion Priors for Generating   Holisticly Expressive Co-speech Gestures [8.50717565369252]
 HoleGest is a novel neural network framework for automatic generation of high-quality, expressive co-speech gestures.
Our system learns a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements.
Our model achieves a level of realism close to the ground truth, providing an immersive user experience.
 arXiv  Detail & Related papers  (2025-03-17T14:42:31Z)
- HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech   Gesture Generation [42.30003982604611]
 Co-speech gestures are crucial non-verbal cues that enhance speech clarity and strides in human communication.
We propose a novel method named HOP for co-speech gesture generation, capturing heterogeneous entanglement between gesture motion, audio rhythm, and text semantics.
HOP achieves state-of-the-art offering more natural and expressive co-speech gesture generation.
 arXiv  Detail & Related papers  (2025-03-03T04:47:39Z)
- Multimodal Fusion with LLMs for Engagement Prediction in Natural   Conversation [70.52558242336988]
 We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion.
In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation.
We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
 arXiv  Detail & Related papers  (2024-09-13T18:28:12Z)
- Incorporating Spatial Awareness in Data-Driven Gesture Generation for   Virtual Agents [17.299991009921307]
 This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures.
Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void.
Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis.
 arXiv  Detail & Related papers  (2024-08-07T23:23:50Z)
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective   Face and Body Expressions from Affordable Inputs [67.27840327499625]
 We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
 arXiv  Detail & Related papers  (2024-06-26T04:53:11Z)
- Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
 Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
 arXiv  Detail & Related papers  (2024-04-02T11:40:34Z)
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture   Synthesis [50.69464138626748]
 We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
 arXiv  Detail & Related papers  (2024-03-26T17:59:52Z)
- Audio is all in one: speech-driven gesture synthetics using WavLM   pre-trained model [2.827070255699381]
 diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
 arXiv  Detail & Related papers  (2023-08-11T08:03:28Z)
- Human Motion Generation: A Survey [67.38982546213371]
 Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications.
Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts.
We present a comprehensive literature review of human motion generation, which is the first of its kind in this field.
 arXiv  Detail & Related papers  (2023-07-20T14:15:20Z)
- Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
  Generation [107.10239561664496]
 We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
 arXiv  Detail & Related papers  (2022-03-24T16:33:29Z)
- Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
 We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
 arXiv  Detail & Related papers  (2021-12-27T07:18:50Z)
- Speech Gesture Generation from the Trimodal Context of Text, Audio, and
  Speaker Identity [21.61168067832304]
 We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
 Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
 arXiv  Detail & Related papers  (2020-09-04T11:42:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.