Related papers: Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

URL: http://arxiv.org/abs/2211.11222v1
Date: Mon, 21 Nov 2022 07:35:21 GMT
Title: Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System
Authors: Takenori Yoshimura, Shinji Takaki, Kazuhiro Nakamura, Keiichiro Oura, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda
Abstract summary: This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system. We show that the proposed system improves speech quality from a baseline system maintaining controllability.
Score: 23.96111084078404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.

Related papers

Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation [52.0893266767733]
We propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features. To enhance the model's robustness to different synthesizer characteristics, we propose a synthesizer feature augmentation strategy.
arXiv Detail & Related papers (2024-11-14T03:57:21Z)
NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability. It generates waveforms at least 280 times faster than the WaveNet vocoder. It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Integrated Speech and Gesture Synthesis [26.267738299876314]
Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities. We propose to synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG) Model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system.
arXiv Detail & Related papers (2021-08-25T19:04:00Z)
FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis [6.509758931804479]
We propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. FastPitchFormant has a unique structure that handles text and acoustic features in parallel.
arXiv Detail & Related papers (2021-06-29T07:06:42Z)
Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis [47.30453049606897]
We find that fine-tuning a multi-speaker model from found audiobook data can improve naturalness and similarity to unseen target speakers of synthetic speech. We also find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet.
arXiv Detail & Related papers (2020-11-10T00:19:04Z)
Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z)
Neural Granular Sound Synthesis [53.828476137089325]
Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small waveform windows. We show that generative neural networks can implement granular synthesis while alleviating most of its shortcomings.
arXiv Detail & Related papers (2020-08-04T08:08:00Z)
VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation. We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
Eigenresiduals for improved Parametric Speech Synthesis [11.481208551940998]
A new excitation model is proposed to produce natural-sounding voices in a speech synthesizer. The model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis. A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis.
arXiv Detail & Related papers (2020-01-02T09:39:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.