Neural Vocoder is All You Need for Speech Super-resolution
- URL: http://arxiv.org/abs/2203.14941v1
- Date: Mon, 28 Mar 2022 17:51:00 GMT
- Title: Neural Vocoder is All You Need for Speech Super-resolution
- Authors: Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang
Wang
- Abstract summary: Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
- Score: 56.84715616516612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech super-resolution (SR) is a task to increase speech sampling rate by
generating high-frequency components. Existing speech SR methods are trained in
constrained experimental settings, such as a fixed upsampling ratio. These
strong constraints can potentially lead to poor generalization ability in
mismatched real-world cases. In this paper, we propose a neural vocoder based
speech super-resolution method (NVSR) that can handle a variety of input
resolution and upsampling ratios. NVSR consists of a mel-bandwidth extension
module, a neural vocoder module, and a post-processing module. Our proposed
system achieves state-of-the-art results on the VCTK multi-speaker benchmark.
On 44.1 kHz target resolution, NVSR outperforms WSRGlow and Nu-wave by 8% and
37% respectively on log spectral distance and achieves a significantly better
perceptual quality. We also demonstrate that prior knowledge in the pre-trained
vocoder is crucial for speech SR by performing mel-bandwidth extension with a
simple replication-padding method. Samples can be found in
https://haoheliu.github.io/nvsr.
Related papers
- Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Decoder-only Architecture for Streaming End-to-end Speech Recognition [45.161909551392085]
We propose to use a decoder-only architecture for blockwise streaming automatic speech recognition (ASR)
In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder.
Our proposed decoder-only streaming ASR achieves 8% relative word error rate reduction in the LibriSpeech test-other set while being twice as fast as the baseline model.
arXiv Detail & Related papers (2024-06-23T13:50:08Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - NERV++: An Enhanced Implicit Neural Video Representation [11.25130799452367]
We introduce neural representations for videos NeRV++, an enhanced implicit neural video representation.
NeRV++ is more straightforward yet effective enhancement over the original NeRV decoder architecture.
We evaluate our method on UVG, MCL JVC, and Bunny datasets, achieving competitive results for video compression with INRs.
arXiv Detail & Related papers (2024-02-28T13:00:32Z) - mdctGAN: Taming transformer-based GAN for speech super-resolution with
Modified DCT spectra [4.721572768262729]
Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart.
Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction.
We propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT)
arXiv Detail & Related papers (2023-05-18T16:49:46Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Towards Lightweight Controllable Audio Synthesis with Conditional
Implicit Neural Representations [10.484851004093919]
Implicit neural representations (INRs) are neural networks used to approximate low-dimensional functions.
In this work we shed light on the potential of Conditional Implicit Neural Representations (CINRs) as lightweight backbones in generative frameworks for audio synthesis.
arXiv Detail & Related papers (2021-11-14T13:36:18Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.