Related papers: Cross-Scale Vector Quantization for Scalable Neural Speech Coding

Cross-Scale Vector Quantization for Scalable Neural Speech Coding

URL: http://arxiv.org/abs/2207.03067v1
Date: Thu, 7 Jul 2022 03:23:25 GMT
Title: Cross-Scale Vector Quantization for Scalable Neural Speech Coding
Authors: Xue Jiang, Xiulian Peng, Huaying Xue, Yuan Zhang, Yan Lu
Abstract summary: Bitrate scalability is a desirable feature for audio coding in real-time communications. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ) In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves quality as more bits are available.
Score: 22.65761249591267
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bitrate scalability is a desirable feature for audio coding in real-time communications. Existing neural audio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which increases the memory footprint at the sender and the receiver side and transcoding is often needed to support multiple receivers. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement. In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves the quality as more bits are available. The proposed CSVQ scheme can be flexibly applied to any neural audio coding network with a mirrored auto-encoder structure to achieve bitrate scalability. Subjective results show that the proposed scheme outperforms the classical residual VQ (RVQ) with scalability. Moreover, the proposed CSVQ at 3 kbps outperforms Opus at 9 kbps and Lyra at 3kbps and it could provide a graceful quality boost with bitrate increase.

Related papers

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z)
Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding [56.066799081747845]
The ever-growing size of neural networks poses serious challenges on resource-constrained devices.<n>We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding.<n>Our method allows for very fast decoding and is compatible with arbitrary quantization grids.
arXiv Detail & Related papers (2025-05-24T15:52:49Z)
One Quantizer is Enough: Toward a Lightweight Audio Codec [10.903708510237875]
We present SQCodec, a lightweight neural audio that leverages a single quantizer to address limitations of existing approaches. SQCodec explores streamlined convolutional networks and local Transformer modules, alongside TConv. Experiments show that SQCodec audio quality comparable to multi-quantizer achieves baselines, while its single-quantizer design offers enhanced adaptability.
arXiv Detail & Related papers (2025-04-07T11:34:39Z)
VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression [29.368893236587343]
Recent neural audio compression models have progressively adopted residual vector quantization (RVQ) These models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoffs. We propose variable RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame.
arXiv Detail & Related papers (2024-10-08T13:18:24Z)
High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency. Uses hierarchical predictive coding to transform each video frame into multiscale representations. Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z)
Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z)
Graph Neural Networks for Channel Decoding [71.15576353630667]
We showcase competitive decoding performance for various coding schemes, such as low-density parity-check (LDPC) and BCH codes. The idea is to let a neural network (NN) learn a generalized message passing algorithm over a given graph. We benchmark our proposed decoder against state-of-the-art in conventional channel decoding as well as against recent deep learning-based results.
arXiv Detail & Related papers (2022-07-29T15:29:18Z)
CADyQ: Content-Aware Dynamic Quantization for Image Super-Resolution [55.50793823060282]
We propose a novel Content-Aware Dynamic Quantization (CADyQ) method for image super-resolution (SR) networks. CADyQ allocates optimal bits to local regions and layers adaptively based on the local contents of an input image. The pipeline has been tested on various SR networks and evaluated on several standard benchmarks.
arXiv Detail & Related papers (2022-07-21T07:50:50Z)
Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework. We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z)
Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components. Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio. We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
Scalable and Efficient Neural Speech Coding [24.959825692325445]
This work presents a scalable and efficient neural waveform (NWC) for speech compression. The proposed CNN autoencoder also defines quantization and coding as a trainable module. Compared to the other autoregressive decoder-based neural speech, our decoder has significantly smaller architecture.
arXiv Detail & Related papers (2021-03-27T00:10:16Z)
Enhancement Of Coded Speech Using a Mask-Based Post-Filter [9.324642081509754]
A data-driven post-filter relying on masking in the time-frequency domain is proposed. A fully connected neural network (FCNN), a convolutional encoder-decoder (CED) network and a long short-term memory (LSTM) network are implemeted to estimate a real-valued mask per time-frequency bin.
arXiv Detail & Related papers (2020-10-12T09:48:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.