Related papers: Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates

Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates

URL: http://arxiv.org/abs/2509.09550v2
Date: Fri, 12 Sep 2025 06:43:25 GMT
Title: Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
Authors: Harry Julian, Rachel Beeson, Lohith Konathala, Johanna Ulin, Jiameng Gao,
Abstract summary: We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
Score: 1.445167946386569
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural Audio Codecs (NACs) have become increasingly adopted in speech processing tasks due to their excellent rate-distortion performance and compatibility with Large Language Models (LLMs) as discrete feature representations for audio generation. While most existing codecs rely on Residual Vector Quantization (RVQ), Finite Scalar Quantization (FSQ) has recently emerged as a compelling alternative that simplifies training and natively supports single codebooks. We introduce NeuCodec, an FSQ-based NAC, and show that FSQ encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels. First, through an encoder distillation experiment, we show that two different encoders can learn to encode identical audio into vastly different code sequences whilst maintaining comparable reconstruction quality with the same quantizer and decoder. Second, we demonstrate that FSQ has vastly superior bit-level perturbation robustness by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.

Related papers

UniSRCodec: Unified and Low-Bitrate Single Codebook Codec with Sub-Band Reconstruction [16.235083704438313]
Neural Audio Codecs (NACs) can reduce transmission overhead by performing compact compression and reconstruction.<n>Existing NACs can be divided into two categories: multi-codebook and single-codebook codecs.<n>We propose the UniSRCodec, a single-codebook capable of supporting high sampling rate, low-bandwidth, high fidelity, and unified.
arXiv Detail & Related papers (2026-01-06T07:20:05Z)
CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio [7.093237513313511]
CoDiCodec is a novel audio autoencoder that overcomes limitations by both efficiently encoding global features via summary embeddings.<n>It produces both compressed continuous embeddings at 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model.<n>Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms.
arXiv Detail & Related papers (2025-09-11T20:31:18Z)
Spectrogram Patch Codec: A 2D Block-Quantized VQ-VAE and HiFi-GAN for Neural Speech Coding [0.0]
We present a neural speech that challenges the need for complex residual vector quantization stacks by introducing a simpler, single-stage quantization approach.<n>Our method operates directly on the mel-spectrogram, treating it as a 2D data and quantizing non-overlapping 4x4 patches into a single, shared codebook.<n>This patchwise design simplifies the architecture, enables low-latency streaming, and yields a discrete latent grid.
arXiv Detail & Related papers (2025-09-02T12:14:41Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z)
Towards Generalized Source Tracing for Codec-Based Deepfake Speech [52.68106957822706]
We introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding.<n>Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
arXiv Detail & Related papers (2025-06-08T21:36:10Z)
L3AC: Towards a Lightweight and Lossless Audio Codec [10.903708510237875]
We introduce L3AC, a lightweight neural audio that addresses challenges by leveraging a single quantizer and a highly efficient architecture.<n>L3AC explores streamlined convolutional networks and local Transformer modules, alongside TConv--a novel structure designed to capture acoustic variations across multiple temporal scales.
arXiv Detail & Related papers (2025-04-07T11:34:39Z)
SNAC: Multi-Scale Neural Audio Codec [1.0753191494611891]
Multi-Scale Neural Audio Codec is a simple extension of RVQ where the quantizers can operate at different temporal resolutions. This paper proposes Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
arXiv Detail & Related papers (2024-10-18T12:24:05Z)
Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos. Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs. A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z)
Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z)
Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features. We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps. We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z)
Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z)
Cross-Scale Vector Quantization for Scalable Neural Speech Coding [22.65761249591267]
Bitrate scalability is a desirable feature for audio coding in real-time communications. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ) In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves quality as more bits are available.
arXiv Detail & Related papers (2022-07-07T03:23:25Z)
Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models. We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words. Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z)
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.