Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural
Speech Synthesis System
- URL: http://arxiv.org/abs/2211.11222v1
- Date: Mon, 21 Nov 2022 07:35:21 GMT
- Title: Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural
Speech Synthesis System
- Authors: Takenori Yoshimura, Shinji Takaki, Kazuhiro Nakamura, Keiichiro Oura,
Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda
- Abstract summary: This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system.
We show that the proposed system improves speech quality from a baseline system maintaining controllability.
- Score: 23.96111084078404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper integrates a classic mel-cepstral synthesis filter into a modern
neural speech synthesis system towards end-to-end controllable speech
synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in
neural waveform models in the proposed system, both voice characteristics and
the pitch of synthesized speech are highly controlled via a frequency warping
parameter and fundamental frequency, respectively. We implement the
mel-cepstral synthesis filter as a differentiable and GPU-friendly module to
enable the acoustic and waveform models in the proposed system to be
simultaneously optimized in an end-to-end manner. Experiments show that the
proposed system improves speech quality from a baseline system maintaining
controllability. The core PyTorch modules used in the experiments will be
publicly available on GitHub.
Related papers
- NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Integrated Speech and Gesture Synthesis [26.267738299876314]
Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities.
We propose to synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG)
Model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system.
arXiv Detail & Related papers (2021-08-25T19:04:00Z) - FastPitchFormant: Source-filter based Decomposed Modeling for Speech
Synthesis [6.509758931804479]
We propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory.
FastPitchFormant has a unique structure that handles text and acoustic features in parallel.
arXiv Detail & Related papers (2021-06-29T07:06:42Z) - Pretraining Strategies, Waveform Model Choice, and Acoustic
Configurations for Multi-Speaker End-to-End Speech Synthesis [47.30453049606897]
We find that fine-tuning a multi-speaker model from found audiobook data can improve naturalness and similarity to unseen target speakers of synthetic speech.
We also find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet.
arXiv Detail & Related papers (2020-11-10T00:19:04Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Neural Granular Sound Synthesis [53.828476137089325]
Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small waveform windows.
We show that generative neural networks can implement granular synthesis while alleviating most of its shortcomings.
arXiv Detail & Related papers (2020-08-04T08:08:00Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z) - Eigenresiduals for improved Parametric Speech Synthesis [11.481208551940998]
A new excitation model is proposed to produce natural-sounding voices in a speech synthesizer.
The model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis.
A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis.
arXiv Detail & Related papers (2020-01-02T09:39:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.