Generating coherent spontaneous speech and gesture from text
- URL: http://arxiv.org/abs/2101.05684v1
- Date: Thu, 14 Jan 2021 16:02:21 GMT
- Title: Generating coherent spontaneous speech and gesture from text
- Authors: Simon Alexanderson, \'Eva Sz\'ekely, Gustav Eje Henter, Taras
Kucherenko, Jonas Beskow
- Abstract summary: Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements)
Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data.
We put these two state-of-the-art technologies together in a coherent fashion for the first time.
- Score: 21.90157862281996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodied human communication encompasses both verbal (speech) and non-verbal
information (e.g., gesture and head movements). Recent advances in machine
learning have substantially improved the technologies for generating synthetic
versions of both of these types of data: On the speech side, text-to-speech
systems are now able to generate highly convincing, spontaneous-sounding speech
using unscripted speech audio as the source material. On the motion side,
probabilistic motion-generation methods can now synthesise vivid and lifelike
speech-driven 3D gesticulation. In this paper, we put these two
state-of-the-art technologies together in a coherent fashion for the first
time. Concretely, we demonstrate a proof-of-concept system trained on a
single-speaker audio and motion-capture dataset, that is able to generate both
speech and full-body gestures together from text input. In contrast to previous
approaches for joint speech-and-gesture generation, we generate full-body
gestures from speech synthesis trained on recordings of spontaneous speech from
the same person as the motion-capture data. We illustrate our results by
visualising gesture spaces and text-speech-gesture alignments, and through a
demonstration video at https://simonalexanderson.github.io/IVA2020 .
Related papers
- Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation.
Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio.
We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z) - AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.