Gull: A Generative Multifunctional Audio Codec
- URL: http://arxiv.org/abs/2404.04947v2
- Date: Fri, 7 Jun 2024 07:03:30 GMT
- Title: Gull: A Generative Multifunctional Audio Codec
- Authors: Yi Luo, Jianwei Yu, Hangting Chen, Rongzhi Gu, Chao Weng,
- Abstract summary: Gull is a general purpose neural audio compression and decompression model.
Key components include universal-sample-rate modeling, gain-shape representations, residual vector quantization modules, elastic decoder network.
- Score: 39.414833825453655
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recent progress in audio source separation, (2) gain-shape representations motivated by traditional audio codecs, (3) improved residual vector quantization modules, (4) elastic decoder network that enables user-defined model size and complexity during inference time, (5) built-in ability for audio super-resolution without the increase of bitrate. We compare Gull with existing traditional and neural audio codecs and show that Gull is able to achieve on par or better performance across various sample rates, bitrates and model complexities in both subjective and objective evaluation metrics.
Related papers
- One Quantizer is Enough: Toward a Lightweight Audio Codec [10.903708510237875]
We present SQCodec, a lightweight neural audio that leverages a single quantizer to address limitations of existing approaches.
SQCodec explores streamlined convolutional networks and local Transformer modules, alongside TConv.
Experiments show that SQCodec audio quality comparable to multi-quantizer achieves baselines, while its single-quantizer design offers enhanced adaptability.
arXiv Detail & Related papers (2025-04-07T11:34:39Z) - FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.
Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z) - SNAC: Multi-Scale Neural Audio Codec [1.0753191494611891]
Multi-Scale Neural Audio Codec is a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
This paper proposes Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions.
arXiv Detail & Related papers (2024-10-18T12:24:05Z) - Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference [10.909997817643905]
We present the Low Frame-rate Speech Codec (LFSC): a neural audio that leverages a finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps and 21.5 frames per second.
We demonstrate that our novel LLM can make the inference of text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
arXiv Detail & Related papers (2024-09-18T16:39:10Z) - SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [40.810505707522324]
SemantiCodec is designed to compress audio into fewer than a hundred tokens per second across diverse audio types.
We show that SemantiCodec significantly outperforms the state-of-the-art Descript on reconstruction quality.
Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated audio codecs.
arXiv Detail & Related papers (2024-04-30T22:51:36Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers [78.85346970193518]
Megabyte is a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes.
Experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling.
Results establish the viability of tokenization-free autoregressive sequence modeling at scale.
arXiv Detail & Related papers (2023-05-12T00:55:41Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate [8.312162364318235]
We present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s.
The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner.
arXiv Detail & Related papers (2021-08-09T14:03:07Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.