WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss
- URL: http://arxiv.org/abs/2002.00417v3
- Date: Tue, 7 Apr 2020 01:24:14 GMT
- Title: WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss
- Authors: Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li
- Abstract summary: Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input.
We propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions.
WaveTTS ensures both the quality of the acoustic features and the resulting speech waveform.
- Score: 74.11899135025503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tacotron-based text-to-speech (TTS) systems directly synthesize speech from
text input. Such frameworks typically consist of a feature prediction network
that maps character sequences to frequency-domain acoustic features, followed
by a waveform reconstruction algorithm or a neural vocoder that generates the
time-domain waveform from acoustic features. As the loss function is usually
calculated only for frequency-domain acoustic features, that doesn't directly
control the quality of the generated time-domain waveform. To address this
problem, we propose a new training scheme for Tacotron-based TTS, referred to
as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the
waveform loss, that measures the distortion between the natural and generated
waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic
feature loss between the natural and generated acoustic features. WaveTTS
ensures both the quality of the acoustic features and the resulting speech
waveform. To our best knowledge, this is the first implementation of Tacotron
with joint time-frequency domain loss. Experimental results show that the
proposed framework outperforms the baselines and achieves high-quality
synthesized speech.
Related papers
- PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation [37.35829410807451]
We propose PeriodWave, a novel universal waveform generation model.
We introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal.
We also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference.
arXiv Detail & Related papers (2024-08-14T13:36:17Z) - Xi-Net: Transformer Based Seismic Waveform Reconstructor [44.99833362998488]
Gaps in seismic waveforms hamper further signal processing to gain valuable information.
We present a transformer-based deep learning model, Xi-Net, which utilizes multi-faceted time and frequency domain inputs for accurate waveform reconstruction.
To the best of our knowledge, this is the first transformer-based deep learning model for seismic waveform reconstruction.
arXiv Detail & Related papers (2024-06-14T22:34:13Z) - WFTNet: Exploiting Global and Local Periodicity in Long-term Time Series
Forecasting [61.64303388738395]
We propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting.
Tests on various time series datasets show WFTNet consistently outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-09-20T13:44:18Z) - Wave simulation in non-smooth media by PINN with quadratic neural
network and PML condition [2.7651063843287718]
The recently proposed physics-informed neural network (PINN) has achieved successful applications in solving a wide range of partial differential equations (PDEs)
In this paper, we solve the acoustic and visco-acoustic scattered-field wave equation in the frequency domain with PINN instead of the wave equation to remove source perturbation.
We show that PML and quadratic neurons improve the results as well as attenuation and discuss the reason for this improvement.
arXiv Detail & Related papers (2022-08-16T13:29:01Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.