Neural Vocoder is All You Need for Speech Super-resolution
- URL: http://arxiv.org/abs/2203.14941v1
- Date: Mon, 28 Mar 2022 17:51:00 GMT
- Title: Neural Vocoder is All You Need for Speech Super-resolution
- Authors: Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang
Wang
- Abstract summary: Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
- Score: 56.84715616516612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech super-resolution (SR) is a task to increase speech sampling rate by
generating high-frequency components. Existing speech SR methods are trained in
constrained experimental settings, such as a fixed upsampling ratio. These
strong constraints can potentially lead to poor generalization ability in
mismatched real-world cases. In this paper, we propose a neural vocoder based
speech super-resolution method (NVSR) that can handle a variety of input
resolution and upsampling ratios. NVSR consists of a mel-bandwidth extension
module, a neural vocoder module, and a post-processing module. Our proposed
system achieves state-of-the-art results on the VCTK multi-speaker benchmark.
On 44.1 kHz target resolution, NVSR outperforms WSRGlow and Nu-wave by 8% and
37% respectively on log spectral distance and achieves a significantly better
perceptual quality. We also demonstrate that prior knowledge in the pre-trained
vocoder is crucial for speech SR by performing mel-bandwidth extension with a
simple replication-padding method. Samples can be found in
https://haoheliu.github.io/nvsr.
Related papers
- Decoder-only Architecture for Streaming End-to-end Speech Recognition [45.161909551392085]
We propose to use a decoder-only architecture for blockwise streaming automatic speech recognition (ASR)
In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder.
Our proposed decoder-only streaming ASR achieves 8% relative word error rate reduction in the LibriSpeech test-other set while being twice as fast as the baseline model.
arXiv Detail & Related papers (2024-06-23T13:50:08Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - NERV++: An Enhanced Implicit Neural Video Representation [11.25130799452367]
We introduce neural representations for videos NeRV++, an enhanced implicit neural video representation.
NeRV++ is more straightforward yet effective enhancement over the original NeRV decoder architecture.
We evaluate our method on UVG, MCL JVC, and Bunny datasets, achieving competitive results for video compression with INRs.
arXiv Detail & Related papers (2024-02-28T13:00:32Z) - mdctGAN: Taming transformer-based GAN for speech super-resolution with
Modified DCT spectra [4.721572768262729]
Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart.
Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction.
We propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT)
arXiv Detail & Related papers (2023-05-18T16:49:46Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Towards Lightweight Controllable Audio Synthesis with Conditional
Implicit Neural Representations [10.484851004093919]
Implicit neural representations (INRs) are neural networks used to approximate low-dimensional functions.
In this work we shed light on the potential of Conditional Implicit Neural Representations (CINRs) as lightweight backbones in generative frameworks for audio synthesis.
arXiv Detail & Related papers (2021-11-14T13:36:18Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.