Related papers: FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

URL: http://arxiv.org/abs/2106.15123v1
Date: Tue, 29 Jun 2021 07:06:42 GMT
Title: FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
Authors: Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho
Abstract summary: We propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. FastPitchFormant has a unique structure that handles text and acoustic features in parallel.
Score: 6.509758931804479
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated.

Related papers

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way. Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z)
Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z)
Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction [14.661123738628772]
We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints.
arXiv Detail & Related papers (2023-11-06T06:13:39Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system. New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise. Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z)
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z)
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings. We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)
Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR) We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT) The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.