SNAC: Multi-Scale Neural Audio Codec
- URL: http://arxiv.org/abs/2410.14411v1
- Date: Fri, 18 Oct 2024 12:24:05 GMT
- Title: SNAC: Multi-Scale Neural Audio Codec
- Authors: Hubert Siuzdak, Florian Grötschla, Luca A. Lanzendörfer,
- Abstract summary: Multi-Scale Neural Audio Codec is a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
This paper proposes Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
- Score: 1.0753191494611891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations. The code and model weights are open-sourced at https://github.com/hubertsiuzdak/snac.
Related papers
- How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection [60.88800374832363]
Recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker.<n>We examine how different labeling choices affect detection performance and provide insights into labeling strategies.
arXiv Detail & Related papers (2026-02-18T10:29:07Z) - Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z) - NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference [19.201753265782685]
Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens.<n>Existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models.<n>We introduce NanoCodec, a state-of-the-art audio that achieves high-quality compression at just 12.5 frames per second (FPS)
arXiv Detail & Related papers (2025-08-07T20:20:32Z) - One Quantizer is Enough: Toward a Lightweight Audio Codec [10.903708510237875]
We present SQCodec, a lightweight neural audio that leverages a single quantizer to address limitations of existing approaches.
SQCodec explores streamlined convolutional networks and local Transformer modules, alongside TConv.
Experiments show that SQCodec audio quality comparable to multi-quantizer achieves baselines, while its single-quantizer design offers enhanced adaptability.
arXiv Detail & Related papers (2025-04-07T11:34:39Z) - FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.
Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z) - A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [65.05719674893999]
We study two different strategies based on token prediction and regression, and introduce a new method based on Schr"odinger Bridge.
We examine how different design choices affect machine and human perception.
arXiv Detail & Related papers (2024-10-29T18:29:39Z) - Learning Source Disentanglement in Neural Audio Codec [20.335701584949526]
We introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation.
By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations.
Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space.
arXiv Detail & Related papers (2024-09-17T14:21:02Z) - Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [36.61105228468503]
X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization stage.
X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications.
Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.
arXiv Detail & Related papers (2024-08-30T10:24:07Z) - WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [40.810505707522324]
SemantiCodec is designed to compress audio into fewer than a hundred tokens per second across diverse audio types.
We show that SemantiCodec significantly outperforms the state-of-the-art Descript on reconstruction quality.
Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated audio codecs.
arXiv Detail & Related papers (2024-04-30T22:51:36Z) - High-Fidelity Audio Compression with Improved RVQGAN [49.7859037103693]
We introduce a high-fidelity universal neural audio compression algorithm that achieves 90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth.
We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio.
arXiv Detail & Related papers (2023-06-11T00:13:00Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Audio Captioning Transformer [44.68751180694813]
Audio captioning aims to automatically generate a natural language description of an audio clip.
Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder.
We propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
arXiv Detail & Related papers (2021-07-21T00:31:50Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.