Eigenresiduals for improved Parametric Speech Synthesis
- URL: http://arxiv.org/abs/2001.00581v1
- Date: Thu, 2 Jan 2020 09:39:07 GMT
- Title: Eigenresiduals for improved Parametric Speech Synthesis
- Authors: Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit
- Abstract summary: A new excitation model is proposed to produce natural-sounding voices in a speech synthesizer.
The model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis.
A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis.
- Score: 11.481208551940998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Statistical parametric speech synthesizers have recently shown their ability
to produce natural-sounding and flexible voices. Unfortunately the delivered
quality suffers from a typical buzziness due to the fact that speech is
vocoded. This paper proposes a new excitation model in order to reduce this
undesirable effect. This model is based on the decomposition of
pitch-synchronous residual frames on an orthonormal basis obtained by Principal
Component Analysis. This basis contains a limited number of eigenresiduals and
is computed on a relatively small speech database. A stream of PCA-based
coefficients is added to our HMM-based synthesizer and allows to generate the
voiced excitation during the synthesis. An improvement compared to the
traditional excitation is reported while the synthesis engine footprint remains
under about 1Mb.
Related papers
- Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural
Speech Synthesis System [23.96111084078404]
This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system.
We show that the proposed system improves speech quality from a baseline system maintaining controllability.
arXiv Detail & Related papers (2022-11-21T07:35:21Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - FastPitchFormant: Source-filter based Decomposed Modeling for Speech
Synthesis [6.509758931804479]
We propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory.
FastPitchFormant has a unique structure that handles text and acoustic features in parallel.
arXiv Detail & Related papers (2021-06-29T07:06:42Z) - Advances in Speech Vocoding for Text-to-Speech with Continuous
Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system.
New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise.
Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z) - Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy
Loss [49.62291237343537]
We propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network.
With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models.
arXiv Detail & Related papers (2020-10-22T20:14:59Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z) - A Deterministic plus Stochastic Model of the Residual Signal for
Improved Parametric Speech Synthesis [11.481208551940998]
We propose an adaptation of the Deterministic plus Model (DSM) for the residual.
The proposed residual model is integrated within a HMM-based speech synthesizer.
Results show a significative improvement for both male and female voices.
arXiv Detail & Related papers (2019-12-29T07:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.