Integrated Speech and Gesture Synthesis
- URL: http://arxiv.org/abs/2108.11436v1
- Date: Wed, 25 Aug 2021 19:04:00 GMT
- Title: Integrated Speech and Gesture Synthesis
- Authors: Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow,
Gustav Eje Henter, \'Eva Sz\'ekely
- Abstract summary: Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities.
We propose to synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG)
Model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system.
- Score: 26.267738299876314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-speech and co-speech gesture synthesis have until now been treated as
separate areas by two different research communities, and applications merely
stack the two technologies using a simple system-level pipeline. This can lead
to modeling inefficiencies and may introduce inconsistencies that limit the
achievable naturalness. We propose to instead synthesize the two modalities in
a single model, a new problem we call integrated speech and gesture synthesis
(ISG). We also propose a set of models modified from state-of-the-art neural
speech-synthesis engines to achieve this goal. We evaluate the models in three
carefully-designed user studies, two of which evaluate the synthesized speech
and gesture in isolation, plus a combined study that evaluates the models like
they will be used in real-world applications -- speech and gesture presented
together. The results show that participants rate one of the proposed
integrated synthesis models as being as good as the state-of-the-art pipeline
system we compare against, in all three tests. The model is able to achieve
this with faster synthesis time and greatly reduced parameter count compared to
the pipeline system, illustrating some of the potential benefits of treating
speech and gesture synthesis together as a single, unified problem. Videos and
code are available on our project page at https://swatsw.github.io/isg_icmi21/
Related papers
- Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis [21.210982054134686]
Methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field.
Existing methods are trained on parallel data from all constituent modalities.
Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material.
arXiv Detail & Related papers (2024-04-30T15:22:19Z) - Unified speech and gesture synthesis using flow matching [24.2094371314481]
This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text.
The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures.
arXiv Detail & Related papers (2023-10-08T14:37:28Z) - ORES: Open-vocabulary Responsible Visual Synthesis [104.7572323359984]
We formalize a new task, Open-vocabulary Responsible Visual Synthesis (ORES), where the synthesis model is able to avoid forbidden visual concepts.
To address this problem, we present a Two-stage Intervention (TIN) framework.
By introducing 1) rewriting with learnable instruction through a large-scale language model (LLM) and 2) synthesizing with prompt intervention on a diffusion model, it can effectively synthesize images avoiding any concepts but following the user's query as much as possible.
arXiv Detail & Related papers (2023-08-26T06:47:34Z) - Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Novel-View Acoustic Synthesis [140.1107768313269]
We introduce the novel-view acoustic synthesis (NVAS) task.
given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint?
We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space.
arXiv Detail & Related papers (2023-01-20T18:49:58Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - On-device neural speech synthesis [3.716815259884143]
Tacotron and WaveRNN have made it possible to construct a fully neural network based TTS system.
We present key modeling improvements and optimization strategies that enable deploying these models on GPU servers and on mobile devices.
The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
arXiv Detail & Related papers (2021-09-17T18:31:31Z) - Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning [6.514358246805895]
We propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system.
We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations.
arXiv Detail & Related papers (2020-08-20T09:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.