Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents
- URL: http://arxiv.org/abs/2309.09346v1
- Date: Sun, 17 Sep 2023 18:46:25 GMT
- Title: Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents
- Authors: Carson Yu Liu, Gelareh Mohammadi, Yang Song and Wafa Johal
- Abstract summary: Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread.
We propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances.
- Score: 5.244401764969407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied agents, in the form of virtual agents or social robots, are rapidly
becoming more widespread. In human-human interactions, humans use nonverbal
behaviours to convey their attitudes, feelings, and intentions. Therefore, this
capability is also required for embodied agents in order to enhance the quality
and effectiveness of their interactions with humans. In this paper, we propose
a novel framework that can generate sequences of joint angles from the speech
text and speech audio utterances. Based on a conditional Generative Adversarial
Network (GAN), our proposed neural network model learns the relationships
between the co-speech gestures and both semantic and acoustic features from the
speech input. In order to train our neural network model, we employ a public
dataset containing co-speech gestures with corresponding speech audio
utterances, which were captured from a single male native English speaker. The
results from both objective and subjective evaluations demonstrate the efficacy
of our gesture-generation framework for Robots and Embodied Agents.
Related papers
- Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction [23.115506530649988]
PerceptiveAgent is an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings.
PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language.
arXiv Detail & Related papers (2024-06-18T15:19:51Z) - Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task [17.190635800969456]
Intention-based Human-Robot Interaction (HRI) systems allow robots to perceive and interpret user actions.
We introduce a hierarchical approach for interpreting user non-verbal cues, like hand gestures, body poses, and facial expressions.
Our evaluation demonstrates the potential of LLMs to interpret non-verbal cues and to combine them with their context-understanding capabilities.
arXiv Detail & Related papers (2024-04-12T12:15:14Z) - Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio
Representation [18.04996323708772]
This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023.
We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture.
The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model.
arXiv Detail & Related papers (2023-09-11T13:51:06Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Few-shot Language Coordination by Modeling Theory of Mind [95.54446989205117]
We study the task of few-shot $textitlanguage coordination$.
We require the lead agent to coordinate with a $textitpopulation$ of agents with different linguistic abilities.
This requires the ability to model the partner's beliefs, a vital component of human communication.
arXiv Detail & Related papers (2021-07-12T19:26:11Z) - Passing a Non-verbal Turing Test: Evaluating Gesture Animations
Generated from Speech [6.445605125467574]
In this paper, we propose a novel, data-driven technique for generating gestures directly from speech.
Our approach is based on the application of Generative Adversarial Neural Networks (GANs) to model the correlation rather than causation between speech and gestures.
For the study, we animate the generated gestures on a virtual character. We find that users are not able to distinguish between the generated and the recorded gestures.
arXiv Detail & Related papers (2021-07-01T19:38:43Z) - Self-supervised reinforcement learning for speaker localisation with the
iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments.
Having a robot that can look toward a speaker could benefit ASR performance in challenging environments.
We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.