Related papers: High Fidelity Neural Audio Compression

High Fidelity Neural Audio Compression

URL: http://arxiv.org/abs/2210.13438v1
Date: Mon, 24 Oct 2022 17:52:02 GMT
Title: High Fidelity Neural Audio Compression
Authors: Alexandre D\'efossez, Jade Copet, Gabriel Synnaeve, Yossi Adi
Abstract summary: We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
Score: 92.4812002532009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

Related papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine [16.046905753937384]
This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model.<n>Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low codecs (i.e., less than 200 bps) with a minimal loss of the downstream model performance.
arXiv Detail & Related papers (2025-07-17T00:32:07Z)
Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders [13.82572699087732]
We propose a simple, post-hoc framework to modify the bottleneck of a pre-trained autoencoder.<n>Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure.<n>Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models.
arXiv Detail & Related papers (2025-07-10T15:47:43Z)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain. WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
RAVE: A variational autoencoder for fast and high-quality neural audio synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z)
Audio Spectral Enhancement: Leveraging Autoencoders for Low Latency Reconstruction of Long, Lossy Audio Sequences [0.0]
We propose a novel approach for reconstructing higher frequencies from considerably longer sequences of low-quality MP3 audio waves. Our architecture presents several bottlenecks while preserving the spectral structure of the audio wave via skip-connections. We show how to leverage differential quantization techniques to reduce the initial model size by more than half while simultaneously reducing inference time.
arXiv Detail & Related papers (2021-08-08T18:06:21Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding [30.307627653506756]
We present a psychoacoustic calibration scheme to re-define the loss functions of neural audio coding systems. With the proposed method, a lightweight neural, with only 0.9 million parameters, performs near-transparent audio coding comparable with the commercial MPEG-1 Audio Layer III at 112 kbps.
arXiv Detail & Related papers (2020-12-31T19:46:46Z)
Ultra-low bitrate video conferencing using deep image animation [7.263312285502382]
We propose a novel deep learning approach for ultra-low video compression for video conferencing applications. We employ deep neural networks to encode motion information as keypoint displacement and reconstruct the video signal at the decoder side.
arXiv Detail & Related papers (2020-12-01T09:06:34Z)
Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness. The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z)
Content Adaptive and Error Propagation Aware Deep Video Compression [110.31693187153084]
We propose a content adaptive and error propagation aware video compression system. Our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame. Instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system.
arXiv Detail & Related papers (2020-03-25T09:04:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.