Enhancing into the codec: Noise Robust Speech Coding with
Vector-Quantized Autoencoders
- URL: http://arxiv.org/abs/2102.06610v1
- Date: Fri, 12 Feb 2021 16:42:19 GMT
- Title: Enhancing into the codec: Noise Robust Speech Coding with
Vector-Quantized Autoencoders
- Authors: Jonah Casebeer, Vinjai Vale, Umut Isik, Jean-Marc Valin, Ritwik Giri,
Arvindh Krishnaswamy
- Abstract summary: We develop compressor-enhancer encoders and accompanying decoders based on VQ-VAE autoencoders with WaveRNN decoders.
We observe that a compressor-enhancer model performs better on clean speech inputs than a compressor model trained only on clean speech.
- Score: 21.74276379834421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio codecs based on discretized neural autoencoders have recently been
developed and shown to provide significantly higher compression levels for
comparable quality speech output. However, these models are tightly coupled
with speech content, and produce unintended outputs in noisy conditions. Based
on VQ-VAE autoencoders with WaveRNN decoders, we develop compressor-enhancer
encoders and accompanying decoders, and show that they operate well in noisy
conditions. We also observe that a compressor-enhancer model performs better on
clean speech inputs than a compressor model trained only on clean speech.
Related papers
- Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference [10.909997817643905]
We present the Low Frame-rate Speech Codec (LFSC): a neural audio that leverages a finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps and 21.5 frames per second.
We demonstrate that our novel LLM can make the inference of text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
arXiv Detail & Related papers (2024-09-18T16:39:10Z) - CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit
for Neural Speech Codec [55.95078490630001]
This paper presents FunCodec, a fundamental neural speech toolkit, which is an extension of the open-source speech processing toolkit FunASR.
FunCodec provides reproducible training recipes and inference scripts for the latest neural speech models, such as SoundStream and Encodec.
Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes.
arXiv Detail & Related papers (2023-09-14T03:18:24Z) - High-Fidelity Audio Compression with Improved RVQGAN [49.7859037103693]
We introduce a high-fidelity universal neural audio compression algorithm that achieves 90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth.
We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio.
arXiv Detail & Related papers (2023-06-11T00:13:00Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Ultra-Low-Bitrate Speech Coding with Pretrained Transformers [28.400364949575103]
Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion.
We use pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias.
arXiv Detail & Related papers (2022-07-05T18:52:11Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate [8.312162364318235]
We present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s.
The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner.
arXiv Detail & Related papers (2021-08-09T14:03:07Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.