Related papers: MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

URL: http://arxiv.org/abs/2506.00385v1
Date: Sat, 31 May 2025 04:31:02 GMT
Title: MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation
Authors: Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen,
Abstract summary: MagiCodec is a novel single-layer, streaming Transformer-based audio.<n>We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in atten high-frequency components.<n>We show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks.
Score: 19.998635838159217
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottleneck, we introduce $\textbf{MagiCodec}$, a novel single-layer, streaming Transformer-based audio codec. MagiCodec is designed with a multistage training pipeline that incorporates Gaussian noise injection and latent regularization, explicitly targeting the enhancement of semantic expressiveness in the generated codes while preserving high reconstruction fidelity. We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in attenuating high-frequency components and fostering robust tokenization. Extensive experimental evaluations show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks. Notably, the tokens produced by MagiCodec exhibit Zipf-like distributions, as observed in natural languages, thereby improving compatibility with language-model-based generative architectures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec.

Related papers

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis [62.09761127079914]
Latent-Mark is the first zero-bit audio watermarking framework designed to survive semantic compression.<n>Our key insight is that robustness to the encode-decode process requires embedding the watermark within the invariant latent space.<n>Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
arXiv Detail & Related papers (2026-03-05T15:51:09Z)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation [27.32235541083431]
FocalCodec-Stream is a hybrid that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms.<n> Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparables.
arXiv Detail & Related papers (2025-09-19T17:57:13Z)
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z)
NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference [19.201753265782685]
Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens.<n>Existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models.<n>We introduce NanoCodec, a state-of-the-art audio that achieves high-quality compression at just 12.5 frames per second (FPS)
arXiv Detail & Related papers (2025-08-07T20:20:32Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z)
Towards Generalized Source Tracing for Codec-Based Deepfake Speech [52.68106957822706]
We introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding.<n>Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
arXiv Detail & Related papers (2025-06-08T21:36:10Z)
One Quantizer is Enough: Toward a Lightweight Audio Codec [10.903708510237875]
We present SQCodec, a lightweight neural audio that leverages a single quantizer to address limitations of existing approaches.<n> SQCodec explores streamlined convolutional networks and local Transformer modules, alongside TConv.<n>Experiments show that SQCodec audio quality comparable to multi-quantizer achieves baselines, while its single-quantizer design offers enhanced adaptability.
arXiv Detail & Related papers (2025-04-07T11:34:39Z)
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.<n>Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z)
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [36.61105228468503]
X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization stage.<n>X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications.<n>Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.
arXiv Detail & Related papers (2024-08-30T10:24:07Z)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [63.8735398698683]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.<n>We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.<n>WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z)
Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos. Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs. A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z)
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [40.810505707522324]
SemantiCodec is designed to compress audio into fewer than a hundred tokens per second across diverse audio types.<n>We show that SemantiCodec significantly outperforms the state-of-the-art Descript on reconstruction quality.<n>Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs.
arXiv Detail & Related papers (2024-04-30T22:51:36Z)
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [55.95078490630001]
This paper presents FunCodec, a fundamental neural speech toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech models, such as SoundStream and Encodec. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes.
arXiv Detail & Related papers (2023-09-14T03:18:24Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.