FlowDec: A flow-based full-band general audio codec with high perceptual quality
- URL: http://arxiv.org/abs/2503.01485v1
- Date: Mon, 03 Mar 2025 12:49:09 GMT
- Title: FlowDec: A flow-based full-band general audio codec with high perceptual quality
- Authors: Simon Welker, Matthew Le, Ricky T. Q. Chen, Wei-Ning Hsu, Timo Gerkmann, Alexander Richard, Yi-Chiao Wu,
- Abstract summary: FlowDec is a neural full-band audio codecs for general audio sampled at 48 kHz.<n>We generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s.
- Score: 90.05968801459524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quality and reducing the required postfilter DNN evaluations from 60 to 6 without any fine-tuning or distillation techniques. We provide theoretical insights and geometric intuitions for our approach in comparison to ScoreDec as well as another recent work that uses flow matching, and conduct ablation studies on our proposed components. We show that FlowDec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music.
Related papers
- FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [10.14555083237668]
FlowMAC is a novel neural audio for high-quality general audio compression at low bit rates based on conditional flow matching (CFM)
FlowMAC achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate.
arXiv Detail & Related papers (2024-09-26T08:32:31Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT
Domain [16.70806998451696]
We propose a mask-based post-filter operating directly in MDCT domain, inducing no extra delay.
The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network.
Our solution is tested on the recently standardized low-delay, low-complexity (LC3) at lowest possible coefficients of 16 kbps.
arXiv Detail & Related papers (2022-01-28T11:08:02Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - Audio Dequantization for High Fidelity Audio Generation in Flow-based
Neural Vocoder [29.63675159839434]
Flow-based neural vocoder has shown significant improvement in real-time speech generation task.
We propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation.
arXiv Detail & Related papers (2020-08-16T09:37:18Z) - Efficient Adaptation of Neural Network Filter for Video Compression [10.769305738505071]
We present an efficient finetuning methodology for neural-network filters.
The fine-tuning is performed at encoder side to adapt the neural network to the specific content that is being encoded.
The proposed method achieves much faster than conventional finetuning approaches.
arXiv Detail & Related papers (2020-07-28T14:24:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.