DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding
- URL: http://arxiv.org/abs/2110.06434v1
- Date: Wed, 13 Oct 2021 01:39:57 GMT
- Title: DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding
- Authors: Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li
- Abstract summary: We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
- Score: 71.73405116189531
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional vocoders are commonly used as analysis tools to provide
interpretable features for downstream tasks such as speech synthesis and voice
conversion. They are built under certain assumptions about the signals
following signal processing principle, therefore, not easily generalizable to
different audio, for example, from speech to singing. In this paper, we propose
a deep neural analyzer, denoted as DeepA - a neural vocoder that extracts F0
and timbre/aperiodicity encoding from the input speech that emulate those
defined in conventional vocoders. Therefore, the resulting parameters are more
interpretable than other latent neural representations. At the same time, as
the deep neural analyzer is learnable, it is expected to be more accurate for
signal reconstruction and manipulation, and generalizable from speech to
singing. The proposed neural analyzer is built based on a variational
autoencoder (VAE) architecture. We show that DeepA improves F0 estimation over
the conventional vocoder (WORLD). To our best knowledge, this is the first
study dedicated to the development of a neural framework for extracting
learnable vocoder-like parameters.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - BrainBERT: Self-supervised representation learning for intracranial
recordings [18.52962864519609]
We create a reusable Transformer, BrainBERT, for intracranial recordings bringing modern representation learning approaches to neuroscience.
Much like in NLP and speech recognition, this Transformer enables classifying complex concepts, with higher accuracy and with much less data.
In the future, far more concepts will be decodable from neural recordings by using representation learning, potentially unlocking the brain like language models unlocked language.
arXiv Detail & Related papers (2023-02-28T07:40:37Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Deep learning approaches for neural decoding: from CNNs to LSTMs and
spikes to fMRI [2.0178765779788495]
Decoding behavior, perception, or cognitive state directly from neural signals has applications in brain-computer interface research.
In the last decade, deep learning has become the state-of-the-art method in many machine learning tasks.
Deep learning has been shown to be a useful tool for improving the accuracy and flexibility of neural decoding across a wide range of tasks.
arXiv Detail & Related papers (2020-05-19T18:10:35Z) - RawNet: Fast End-to-End Neural Vocoder [4.507860128918788]
RawNet is a complete end-to-end neural vocoder based on the auto-encoder structure for speaker-dependent and -independent speech synthesis.
It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner.
arXiv Detail & Related papers (2019-04-10T10:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.