Unified speech and gesture synthesis using flow matching
- URL: http://arxiv.org/abs/2310.05181v2
- Date: Tue, 9 Jan 2024 21:23:54 GMT
- Title: Unified speech and gesture synthesis using flow matching
- Authors: Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, \'Eva
Sz\'ekely, Gustav Eje Henter
- Abstract summary: This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text.
The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures.
- Score: 24.2094371314481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As text-to-speech technologies achieve remarkable naturalness in read-aloud
tasks, there is growing interest in multimodal synthesis of verbal and
non-verbal communicative behaviour, such as spontaneous speech and associated
body gestures. This paper presents a novel, unified architecture for jointly
synthesising speech acoustics and skeleton-based 3D gesture motion from text,
trained using optimal-transport conditional flow matching (OT-CFM). The
proposed architecture is simpler than the previous state of the art, has a
smaller memory footprint, and can capture the joint distribution of speech and
gestures, generating both modalities together in one single process. The new
training regime, meanwhile, enables better synthesis quality in much fewer
steps (network evaluations) than before. Uni- and multimodal subjective tests
demonstrate improved speech naturalness, gesture human-likeness, and
cross-modal appropriateness compared to existing benchmarks. Please see
https://shivammehta25.github.io/Match-TTSG/ for video examples and code.
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons [16.52004713662265]
We present a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons.
We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention.
Experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.
arXiv Detail & Related papers (2023-09-13T16:07:25Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z) - QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Integrated Speech and Gesture Synthesis [26.267738299876314]
Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities.
We propose to synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG)
Model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system.
arXiv Detail & Related papers (2021-08-25T19:04:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.