Related papers: CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio

CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio

URL: http://arxiv.org/abs/2509.09836v1
Date: Thu, 11 Sep 2025 20:31:18 GMT
Title: CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio
Authors: Marco Pasini, Stefan Lattner, George Fazekas,
Abstract summary: CoDiCodec is a novel audio autoencoder that overcomes limitations by both efficiently encoding global features via summary embeddings.<n>It produces both compressed continuous embeddings at 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model.<n>Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms.
Score: 7.093237513313511
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficiently representing audio signals in a compressed latent space is critical for latent generative modelling. However, existing autoencoders often force a choice between continuous embeddings and discrete tokens. Furthermore, achieving high compression ratios while maintaining audio fidelity remains a challenge. We introduce CoDiCodec, a novel audio autoencoder that overcomes these limitations by both efficiently encoding global features via summary embeddings, and by producing both compressed continuous embeddings at ~ 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model, offering unprecedented flexibility for different downstream generative tasks. This is achieved through Finite Scalar Quantization (FSQ) and a novel FSQ-dropout technique, and does not require additional loss terms beyond the single consistency loss used for end-to-end training. CoDiCodec supports both autoregressive decoding and a novel parallel decoding strategy, with the latter achieving superior audio quality and faster decoding. CoDiCodec outperforms existing continuous and discrete autoencoders at similar bitrates in terms of reconstruction audio quality. Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms.

Related papers

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis [62.09761127079914]
Latent-Mark is the first zero-bit audio watermarking framework designed to survive semantic compression.<n>Our key insight is that robustness to the encode-decode process requires embedding the watermark within the invariant latent space.<n>Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
arXiv Detail & Related papers (2026-03-05T15:51:09Z)
UniSRCodec: Unified and Low-Bitrate Single Codebook Codec with Sub-Band Reconstruction [16.235083704438313]
Neural Audio Codecs (NACs) can reduce transmission overhead by performing compact compression and reconstruction.<n>Existing NACs can be divided into two categories: multi-codebook and single-codebook codecs.<n>We propose the UniSRCodec, a single-codebook capable of supporting high sampling rate, low-bandwidth, high fidelity, and unified.
arXiv Detail & Related papers (2026-01-06T07:20:05Z)
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z)
Music2Latent2: Audio Compression with Summary Embeddings and Autoregressive Decoding [0.0]
We introduce Music2Latent2, a novel audio autoencoder that compresses audio signals into sets of summary embeddings.<n>Unlike conventional methods that encode local audio features into ordered sequences, Music2Latent2 compresses audio signals into sets of summary embeddings.<n>To handle arbitrary audio lengths, Music2Latent2 employs an autoregressive consistency model trained on two consecutive audio chunks with causal masking.
arXiv Detail & Related papers (2025-01-29T11:34:19Z)
Epsilon-VAE: Denoising as Visual Decoding [61.29255979767292]
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement.<n>Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image.<n>By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality.
arXiv Detail & Related papers (2024-10-05T08:27:53Z)
Music2Latent: Consistency Autoencoders for Latent Audio Compression [0.0]
We introduce Music2Latent, an audio autoencoder that overcomes limitations by leveraging consistency models. Music2Latent encodes samples into a compressed continuous latent space in a single end-to-end training process. We demonstrate that Music2Latent outperforms existing continuous audio autoencoders in sound quality and reconstruction accuracy.
arXiv Detail & Related papers (2024-08-12T21:25:19Z)
Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos. Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs. A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.