One Quantizer is Enough: Toward a Lightweight Audio Codec
- URL: http://arxiv.org/abs/2504.04949v1
- Date: Mon, 07 Apr 2025 11:34:39 GMT
- Title: One Quantizer is Enough: Toward a Lightweight Audio Codec
- Authors: Linwei Zhai, Han Ding, Cui Zhao, fei wang, Ge Wang, Wang Zhi, Wei Xi,
- Abstract summary: We present SQCodec, a lightweight neural audio that leverages a single quantizer to address limitations of existing approaches.<n> SQCodec explores streamlined convolutional networks and local Transformer modules, alongside TConv.<n>Experiments show that SQCodec audio quality comparable to multi-quantizer achieves baselines, while its single-quantizer design offers enhanced adaptability.
- Score: 10.903708510237875
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Neural audio codecs have recently gained traction for their ability to compress high-fidelity audio and generate discrete tokens that can be utilized in downstream generative modeling tasks. However, leading approaches often rely on resource-intensive models and multi-quantizer architectures, resulting in considerable computational overhead and constrained real-world applicability. In this paper, we present SQCodec, a lightweight neural audio codec that leverages a single quantizer to address these limitations. SQCodec explores streamlined convolutional networks and local Transformer modules, alongside TConv, a novel mechanism designed to capture acoustic variations across multiple temporal scales, thereby enhancing reconstruction fidelity while reducing model complexity. Extensive experiments across diverse datasets show that SQCodec achieves audio quality comparable to multi-quantizer baselines, while its single-quantizer design offers enhanced adaptability and its lightweight architecture reduces resource consumption by an order of magnitude. The source code is publicly available at https://github.com/zhai-lw/SQCodec.
Related papers
- FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.<n>Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z) - Scaling Transformers for Low-Bitrate High-Quality Speech Coding [15.58137711465863]
We show that it is possible to reach state-of-the-art speech quality at extremely low bit-rates of $400$ or $700$ bits-per-second.
The trained models strongly out-perform existing baselines in both objective and subjective tests.
arXiv Detail & Related papers (2024-11-29T16:58:02Z) - SNAC: Multi-Scale Neural Audio Codec [1.0753191494611891]
Multi-Scale Neural Audio Codec is a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
This paper proposes Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
arXiv Detail & Related papers (2024-10-18T12:24:05Z) - Learning Source Disentanglement in Neural Audio Codec [20.335701584949526]
We introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation.<n>By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations.<n> Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space.
arXiv Detail & Related papers (2024-09-17T14:21:02Z) - WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [63.8735398698683]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.<n>We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.<n>WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - Gull: A Generative Multifunctional Audio Codec [39.414833825453655]
Gull is a general purpose neural audio compression and decompression model.
Key components include universal-sample-rate modeling, gain-shape representations, residual vector quantization modules, elastic decoder network.
arXiv Detail & Related papers (2024-04-07T12:57:46Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken
Command Recognition [69.97260364850001]
We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues.
We project acoustic features based on classical-to-quantum feature encoding.
arXiv Detail & Related papers (2022-11-02T16:46:23Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.