R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS
- URL: http://arxiv.org/abs/2206.15276v1
- Date: Thu, 30 Jun 2022 13:29:31 GMT
- Title: R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS
- Authors: Kyle Kastner, Aaron Courville
- Abstract summary: This paper introduces R-MelNet, a two-part autoregressive architecture with a backend WaveRNN-style audio decoder.
The model produces low-resolution mel-spectral features which are used by a WaveRNN decoder to produce an audio waveform.
- Score: 1.8927791081850118
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper introduces R-MelNet, a two-part autoregressive architecture with a
frontend based on the first tier of MelNet and a backend WaveRNN-style audio
decoder for neural text-to-speech synthesis. Taking as input a mixed sequence
of characters and phonemes, with an optional audio priming sequence, this model
produces low-resolution mel-spectral features which are interpolated and used
by a WaveRNN decoder to produce an audio waveform. Coupled with half precision
training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity
GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for
stable half precision training, including an approximate, numerically stable
mixture of logistics attention. Using a stochastic, multi-sample per step
inference scheme, the resulting model generates highly varied audio, while
enabling text and audio based controls to modify output waveforms. Qualitative
and quantitative evaluations of an R-MelNet system trained on a single speaker
TTS dataset demonstrate the effectiveness of our approach.
Related papers
- SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model [31.280358048556444]
This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism.
The proposed system also incorporates elements like the fundamental pitch (F0) predictor and waveform generation decoder.
Experiments on the Opencpop dataset demonstrate efficacy of the proposed model in intonation quality and accuracy.
arXiv Detail & Related papers (2024-10-16T13:18:45Z) - Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models [42.39774323584976]
We propose a deep learning based system for the task of deepfake audio detection.
In particular, the draw input audio is first transformed into various spectrograms.
We leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings.
arXiv Detail & Related papers (2024-07-01T20:10:43Z) - Masked Audio Generation using a Single Non-Autoregressive Transformer [90.11646612273965]
MAGNeT is a masked generative sequence modeling method that operates directly over several streams of audio tokens.
We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation.
We shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling.
arXiv Detail & Related papers (2024-01-09T14:29:39Z) - Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification [0.0]
We propose a recalibration of features using attention feature fusion for synthetic speech detection.
We compare its performance against different detection methods including End2End models and Resnet-based models.
We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
arXiv Detail & Related papers (2022-10-21T04:21:56Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - High-Fidelity and Low-Latency Universal Neural Vocoder based on
Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform
Modeling [38.828260316517536]
This paper presents a novel universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP)
Experiments demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions.
arXiv Detail & Related papers (2021-05-20T16:02:45Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.