Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents
- URL: http://arxiv.org/abs/2309.09346v1
- Date: Sun, 17 Sep 2023 18:46:25 GMT
- Title: Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents
- Authors: Carson Yu Liu, Gelareh Mohammadi, Yang Song and Wafa Johal
- Abstract summary: Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread.
We propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances.
- Score: 5.244401764969407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied agents, in the form of virtual agents or social robots, are rapidly
becoming more widespread. In human-human interactions, humans use nonverbal
behaviours to convey their attitudes, feelings, and intentions. Therefore, this
capability is also required for embodied agents in order to enhance the quality
and effectiveness of their interactions with humans. In this paper, we propose
a novel framework that can generate sequences of joint angles from the speech
text and speech audio utterances. Based on a conditional Generative Adversarial
Network (GAN), our proposed neural network model learns the relationships
between the co-speech gestures and both semantic and acoustic features from the
speech input. In order to train our neural network model, we employ a public
dataset containing co-speech gestures with corresponding speech audio
utterances, which were captured from a single male native English speaker. The
results from both objective and subjective evaluations demonstrate the efficacy
of our gesture-generation framework for Robots and Embodied Agents.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - SIFToM: Robust Spoken Instruction Following through Theory of Mind [51.326266354164716]
We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions.
Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks.
arXiv Detail & Related papers (2024-09-17T02:36:10Z) - Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction [23.115506530649988]
PerceptiveAgent is an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings.
PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language.
arXiv Detail & Related papers (2024-06-18T15:19:51Z) - Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio
Representation [18.04996323708772]
This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023.
We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture.
The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model.
arXiv Detail & Related papers (2023-09-11T13:51:06Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Few-shot Language Coordination by Modeling Theory of Mind [95.54446989205117]
We study the task of few-shot $textitlanguage coordination$.
We require the lead agent to coordinate with a $textitpopulation$ of agents with different linguistic abilities.
This requires the ability to model the partner's beliefs, a vital component of human communication.
arXiv Detail & Related papers (2021-07-12T19:26:11Z) - Passing a Non-verbal Turing Test: Evaluating Gesture Animations
Generated from Speech [6.445605125467574]
In this paper, we propose a novel, data-driven technique for generating gestures directly from speech.
Our approach is based on the application of Generative Adversarial Neural Networks (GANs) to model the correlation rather than causation between speech and gestures.
For the study, we animate the generated gestures on a virtual character. We find that users are not able to distinguish between the generated and the recorded gestures.
arXiv Detail & Related papers (2021-07-01T19:38:43Z) - Self-supervised reinforcement learning for speaker localisation with the
iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments.
Having a robot that can look toward a speaker could benefit ASR performance in challenging environments.
We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.