RawNet: Fast End-to-End Neural Vocoder
- URL: http://arxiv.org/abs/1904.05351v2
- Date: Fri, 10 Mar 2023 03:22:22 GMT
- Title: RawNet: Fast End-to-End Neural Vocoder
- Authors: Yunchao He, Yujun Wang
- Abstract summary: RawNet is a complete end-to-end neural vocoder based on the auto-encoder structure for speaker-dependent and -independent speech synthesis.
It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner.
- Score: 4.507860128918788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural network-based vocoders have recently demonstrated the powerful ability
to synthesize high-quality speech. These models usually generate samples by
conditioning on spectral features, such as Mel-spectrogram and fundamental
frequency, which is crucial to speech synthesis. However, the feature
extraction procession tends to depend heavily on human knowledge resulting in a
less expressive description of the origin audio. In this work, we proposed
RawNet, a complete end-to-end neural vocoder following the auto-encoder
structure for speaker-dependent and -independent speech synthesis. It
automatically learns to extract features and recover audio using neural
networks, which include a coder network to capture a higher representation of
the input audio and an autoregressive voder network to restore the audio in a
sample-by-sample manner. The coder and voder are jointly trained directly on
the raw waveform without any human-designed features. The experimental results
show that RawNet achieves a better speech quality using a simplified model
architecture and obtains a faster speech generation speed at the inference
stage.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - Audio Captioning Transformer [44.68751180694813]
Audio captioning aims to automatically generate a natural language description of an audio clip.
Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder.
We propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
arXiv Detail & Related papers (2021-07-21T00:31:50Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.