Cross-Scale Vector Quantization for Scalable Neural Speech Coding
- URL: http://arxiv.org/abs/2207.03067v1
- Date: Thu, 7 Jul 2022 03:23:25 GMT
- Title: Cross-Scale Vector Quantization for Scalable Neural Speech Coding
- Authors: Xue Jiang, Xiulian Peng, Huaying Xue, Yuan Zhang, Yan Lu
- Abstract summary: Bitrate scalability is a desirable feature for audio coding in real-time communications.
In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ)
In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves quality as more bits are available.
- Score: 22.65761249591267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bitrate scalability is a desirable feature for audio coding in real-time
communications. Existing neural audio codecs usually enforce a specific bitrate
during training, so different models need to be trained for each target
bitrate, which increases the memory footprint at the sender and the receiver
side and transcoding is often needed to support multiple receivers. In this
paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ),
in which multi-scale features are encoded progressively with stepwise feature
fusion and refinement. In this way, a coarse-level signal is reconstructed if
only a portion of the bitstream is received, and progressively improves the
quality as more bits are available. The proposed CSVQ scheme can be flexibly
applied to any neural audio coding network with a mirrored auto-encoder
structure to achieve bitrate scalability. Subjective results show that the
proposed scheme outperforms the classical residual VQ (RVQ) with scalability.
Moreover, the proposed CSVQ at 3 kbps outperforms Opus at 9 kbps and Lyra at
3kbps and it could provide a graceful quality boost with bitrate increase.
Related papers
- VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression [29.368893236587343]
Recent neural audio compression models have progressively adopted residual vector quantization (RVQ)
These models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoffs.
We propose variable RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame.
arXiv Detail & Related papers (2024-10-08T13:18:24Z) - High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency.
Uses hierarchical predictive coding to transform each video frame into multiscale representations.
Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Graph Neural Networks for Channel Decoding [71.15576353630667]
We showcase competitive decoding performance for various coding schemes, such as low-density parity-check (LDPC) and BCH codes.
The idea is to let a neural network (NN) learn a generalized message passing algorithm over a given graph.
We benchmark our proposed decoder against state-of-the-art in conventional channel decoding as well as against recent deep learning-based results.
arXiv Detail & Related papers (2022-07-29T15:29:18Z) - CADyQ: Content-Aware Dynamic Quantization for Image Super-Resolution [55.50793823060282]
We propose a novel Content-Aware Dynamic Quantization (CADyQ) method for image super-resolution (SR) networks.
CADyQ allocates optimal bits to local regions and layers adaptively based on the local contents of an input image.
The pipeline has been tested on various SR networks and evaluated on several standard benchmarks.
arXiv Detail & Related papers (2022-07-21T07:50:50Z) - Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Scalable and Efficient Neural Speech Coding [24.959825692325445]
This work presents a scalable and efficient neural waveform (NWC) for speech compression.
The proposed CNN autoencoder also defines quantization and coding as a trainable module.
Compared to the other autoregressive decoder-based neural speech, our decoder has significantly smaller architecture.
arXiv Detail & Related papers (2021-03-27T00:10:16Z) - Enhancement Of Coded Speech Using a Mask-Based Post-Filter [9.324642081509754]
A data-driven post-filter relying on masking in the time-frequency domain is proposed.
A fully connected neural network (FCNN), a convolutional encoder-decoder (CED) network and a long short-term memory (LSTM) network are implemeted to estimate a real-valued mask per time-frequency bin.
arXiv Detail & Related papers (2020-10-12T09:48:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.