Adaptation of Tacotron2-based Text-To-Speech for
Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
- URL: http://arxiv.org/abs/2107.12051v1
- Date: Mon, 26 Jul 2021 09:19:20 GMT
- Title: Adaptation of Tacotron2-based Text-To-Speech for
Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
- Authors: Csaba Zaink\'o, L\'aszl\'o T\'oth, Amin Honarmandi Shandiz, G\'abor
Gosztolya, Alexandra Mark\'o, G\'eza N\'emeth, Tam\'as G\'abor Csap\'o
- Abstract summary: This paper experiments with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve articulatory-to-acoustic mapping.
We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder.
- Score: 48.7576911714538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For articulatory-to-acoustic mapping, typically only limited parallel
training data is available, making it impossible to apply fully end-to-end
solutions like Tacotron2. In this paper, we experimented with transfer learning
and adaptation of a Tacotron2 text-to-speech model to improve the final
synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a
limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a
pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion
contains three steps: 1) from a sequence of ultrasound tongue image recordings,
a 3D convolutional neural network predicts the inputs of the pre-trained
Tacotron2 model, 2) the Tacotron2 model converts this intermediate
representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model
is applied for final inference. This generated speech contains the timing of
the original articulatory data from the ultrasound recording, but the F0
contour and the spectral information is predicted by the Tacotron2 model. The
F0 values are independent of the original ultrasound images, but represent the
target speaker, as they are inferred from the pre-trained Tacotron2 model. In
our experiments, we demonstrated that the synthesized speech quality is more
natural with the proposed solutions than with our earlier model.
Related papers
- Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - FastPitchFormant: Source-filter based Decomposed Modeling for Speech
Synthesis [6.509758931804479]
We propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory.
FastPitchFormant has a unique structure that handles text and acoustic features in parallel.
arXiv Detail & Related papers (2021-06-29T07:06:42Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.