Related papers: EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

URL: http://arxiv.org/abs/2601.17517v2
Date: Tue, 27 Jan 2026 21:36:05 GMT
Title: EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding
Authors: Luca Cerovaz, Michele Mancusi, Emanuele Rodolà,
Abstract summary: Most frequency-domain neural codecs disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity.<n>This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability.<n>In this work we introduce an end-to-end complex-valued RVQ-VAE audio that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline.
Score: 18.199202388702144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase modeling which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance. Compared to standard baselines that train for hundreds of thousands of steps, our model reducing training budget by an order of magnitude is markedly more compute-efficient while preserving high perceptual quality.

Related papers

Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection [6.042897432654865]
Spectral-cONtrastive Audio Residuals (AR) is a frequency-guided framework for deepfake audio detectors.<n>AR disentangles an audio signal into complementary representations.<n> evaluated on the ASVspoof 2021 and in-the-wild benchmarks.
arXiv Detail & Related papers (2025-11-26T12:16:38Z)
Learning to Upsample and Upmix Audio in the Latent Domain [14.777092647088756]
Neural audio autoencoders create compact latent representations that preserve perceptually important information.<n>We propose a framework that performs audio processing operations entirely within an autoencoder's latent space.<n>We demonstrate computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio.
arXiv Detail & Related papers (2025-05-31T19:27:22Z)
DDT: Decoupled Diffusion Transformer [51.84206763079382]
Diffusion transformers encode noisy inputs to extract semantic component and decode higher frequency with identical modules.<n>textbfcolorddtDecoupled textbfcolorddtTransformer(textbfcolorddtDDT)<n>textbfcolorddtTransformer(textbfcolorddtDDT)<n>textbfcolorddtTransformer(textbfcolorddtDDT)
arXiv Detail & Related papers (2025-04-08T07:17:45Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.<n>We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.<n>We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation [39.45425155123186]
We develop a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
arXiv Detail & Related papers (2023-09-05T19:19:22Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.