FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates
- URL: http://arxiv.org/abs/2409.17635v1
- Date: Thu, 26 Sep 2024 08:32:31 GMT
- Title: FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates
- Authors: Nicola Pia and Martin Strauss and Markus Multrus and Bernd Edler
- Abstract summary: FlowMAC is a novel neural audio for high-quality general audio compression at low bit rates based on conditional flow matching (CFM)
FlowMAC achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate.
- Score: 10.14555083237668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces FlowMAC, a novel neural audio codec for high-quality
general audio compression at low bit rates based on conditional flow matching
(CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder.
At inference time the decoder integrates a continuous normalizing flow via an
ODE solver to generate a high-quality mel spectrogram. This is the first time
that a CFM-based approach is applied to general audio coding, enabling a
scalable, simple and memory efficient training. Our subjective evaluations show
that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based
and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC
offers a tunable inference pipeline, which permits to trade off complexity and
quality. This enables real-time coding on CPU, while maintaining high
perceptual quality.
Related papers
- A Quantum Approximate Optimization Algorithm-based Decoder Architecture for NextG Wireless Channel Codes [6.52154420965995]
Forward Error Correction (FEC) provides reliable data flow in wireless networks despite the presence of noise and interference.
FEC processing demands significant fraction of a wireless network's resources, due to its computationally-expensive decoding process.
We present FDeQ, a QAOA-based FEC Decoder design targeting the popular NextG wireless Low Density Parity Check (LDPC) and Polar codes.
FDeQ achieves successful decoding with error performance at par with state-of-the-art classical decoders at low FEC code block lengths.
arXiv Detail & Related papers (2024-08-21T15:53:09Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Denoising Diffusion Error Correction Codes [92.10654749898927]
Recently, neural decoders have demonstrated their advantage over classical decoding techniques.
Recent state-of-the-art neural decoders suffer from high complexity and lack the important iterative scheme characteristic of many legacy decoders.
We propose to employ denoising diffusion models for the soft decoding of linear codes at arbitrary block lengths.
arXiv Detail & Related papers (2022-09-16T11:00:50Z) - Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z) - Cross-Scale Vector Quantization for Scalable Neural Speech Coding [22.65761249591267]
Bitrate scalability is a desirable feature for audio coding in real-time communications.
In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ)
In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves quality as more bits are available.
arXiv Detail & Related papers (2022-07-07T03:23:25Z) - Improved decoding of circuit noise and fragile boundaries of tailored
surface codes [61.411482146110984]
We introduce decoders that are both fast and accurate, and can be used with a wide class of quantum error correction codes.
Our decoders, named belief-matching and belief-find, exploit all noise information and thereby unlock higher accuracy demonstrations of QEC.
We find that the decoders led to a much higher threshold and lower qubit overhead in the tailored surface code with respect to the standard, square surface code.
arXiv Detail & Related papers (2022-03-09T18:48:54Z) - A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT
Domain [16.70806998451696]
We propose a mask-based post-filter operating directly in MDCT domain, inducing no extra delay.
The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network.
Our solution is tested on the recently standardized low-delay, low-complexity (LC3) at lowest possible coefficients of 16 kbps.
arXiv Detail & Related papers (2022-01-28T11:08:02Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate [8.312162364318235]
We present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s.
The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner.
arXiv Detail & Related papers (2021-08-09T14:03:07Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - Enhancement Of Coded Speech Using a Mask-Based Post-Filter [9.324642081509754]
A data-driven post-filter relying on masking in the time-frequency domain is proposed.
A fully connected neural network (FCNN), a convolutional encoder-decoder (CED) network and a long short-term memory (LSTM) network are implemeted to estimate a real-valued mask per time-frequency bin.
arXiv Detail & Related papers (2020-10-12T09:48:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.