SNAC: Multi-Scale Neural Audio Codec
- URL: http://arxiv.org/abs/2410.14411v1
- Date: Fri, 18 Oct 2024 12:24:05 GMT
- Title: SNAC: Multi-Scale Neural Audio Codec
- Authors: Hubert Siuzdak, Florian Grötschla, Luca A. Lanzendörfer,
- Abstract summary: Multi-Scale Neural Audio Codec is a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
This paper proposes Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
- Score: 1.0753191494611891
- License:
- Abstract: Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations. The code and model weights are open-sourced at https://github.com/hubertsiuzdak/snac.
Related papers
- A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [65.05719674893999]
We study two different strategies based on token prediction and regression, and introduce a new method based on Schr"odinger Bridge.
We examine how different design choices affect machine and human perception.
arXiv Detail & Related papers (2024-10-29T18:29:39Z) - Learning Source Disentanglement in Neural Audio Codec [20.335701584949526]
We introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation.
By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations.
Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space.
arXiv Detail & Related papers (2024-09-17T14:21:02Z) - Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [36.61105228468503]
X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization stage.
X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications.
Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.
arXiv Detail & Related papers (2024-08-30T10:24:07Z) - WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [40.810505707522324]
SemantiCodec is designed to compress audio into fewer than a hundred tokens per second across diverse audio types.
We show that SemantiCodec significantly outperforms the state-of-the-art Descript on reconstruction quality.
Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated audio codecs.
arXiv Detail & Related papers (2024-04-30T22:51:36Z) - High-Fidelity Audio Compression with Improved RVQGAN [49.7859037103693]
We introduce a high-fidelity universal neural audio compression algorithm that achieves 90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth.
We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio.
arXiv Detail & Related papers (2023-06-11T00:13:00Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Audio Captioning Transformer [44.68751180694813]
Audio captioning aims to automatically generate a natural language description of an audio clip.
Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder.
We propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
arXiv Detail & Related papers (2021-07-21T00:31:50Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.