Low Bit-Rate Wideband Speech Coding: A Deep Generative Model based
Approach
- URL: http://arxiv.org/abs/2102.02640v1
- Date: Thu, 4 Feb 2021 14:37:16 GMT
- Title: Low Bit-Rate Wideband Speech Coding: A Deep Generative Model based
Approach
- Authors: Gang Min, Xiongwei Zhang, Xia Zou, Xiangyang Liu
- Abstract summary: Traditional low bit-rate speech coding approach only handles narrowband speech at 8kHz.
This paper presents a new approach through vector quantization (VQ) of mel-frequency cepstral coefficients (MFCCs)
It provides better speech quality compared with the state-of-the-art classic MELPegressive at lower bit-rate.
- Score: 4.02517560480215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional low bit-rate speech coding approach only handles narrowband
speech at 8kHz, which limits further improvements in speech quality. Motivated
by recent successful exploration of deep learning methods for image and speech
compression, this paper presents a new approach through vector quantization
(VQ) of mel-frequency cepstral coefficients (MFCCs) and using a deep generative
model called WaveGlow to provide efficient and high-quality speech coding. The
coding feature is sorely an 80-dimension MFCCs vector for 16kHz wideband
speech, then speech coding at the bit-rate throughout 1000-2000 bit/s could be
scalably implemented by applying different VQ schemes for MFCCs vector. This
new deep generative network based codec works fast as the WaveGlow model
abandons the sample-by-sample autoregressive mechanism. We evaluated this new
approach over the multi-speaker TIMIT corpus, and experimental results
demonstrate that it provides better speech quality compared with the
state-of-the-art classic MELPe codec at lower bit-rate.
Related papers
- LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec [14.7377193484733]
We propose LSCodec, a discrete speech that has both low and speaker decoupling ability.
By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines.
arXiv Detail & Related papers (2024-10-21T08:23:31Z) - VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression [29.368893236587343]
Recent neural audio compression models have progressively adopted residual vector quantization (RVQ)
These models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoffs.
We propose variable RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame.
arXiv Detail & Related papers (2024-10-08T13:18:24Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Graph Neural Networks for Channel Decoding [71.15576353630667]
We showcase competitive decoding performance for various coding schemes, such as low-density parity-check (LDPC) and BCH codes.
The idea is to let a neural network (NN) learn a generalized message passing algorithm over a given graph.
We benchmark our proposed decoder against state-of-the-art in conventional channel decoding as well as against recent deep learning-based results.
arXiv Detail & Related papers (2022-07-29T15:29:18Z) - Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate [8.312162364318235]
We present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s.
The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner.
arXiv Detail & Related papers (2021-08-09T14:03:07Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - Scalable and Efficient Neural Speech Coding [24.959825692325445]
This work presents a scalable and efficient neural waveform (NWC) for speech compression.
The proposed CNN autoencoder also defines quantization and coding as a trainable module.
Compared to the other autoregressive decoder-based neural speech, our decoder has significantly smaller architecture.
arXiv Detail & Related papers (2021-03-27T00:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.