Related papers: FlowDec: A flow-based full-band general audio codec with high perceptual quality

FlowDec: A flow-based full-band general audio codec with high perceptual quality

URL: http://arxiv.org/abs/2503.01485v1
Date: Mon, 03 Mar 2025 12:49:09 GMT
Title: FlowDec: A flow-based full-band general audio codec with high perceptual quality
Authors: Simon Welker, Matthew Le, Ricky T. Q. Chen, Wei-Ning Hsu, Timo Gerkmann, Alexander Richard, Yi-Chiao Wu,
Abstract summary: FlowDec is a neural full-band audio codecs for general audio sampled at 48 kHz.<n>We generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s.
Score: 90.05968801459524
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quality and reducing the required postfilter DNN evaluations from 60 to 6 without any fine-tuning or distillation techniques. We provide theoretical insights and geometric intuitions for our approach in comparison to ScoreDec as well as another recent work that uses flow matching, and conduct ablation studies on our proposed components. We show that FlowDec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music.

Related papers

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation [27.32235541083431]
FocalCodec-Stream is a hybrid that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms.<n> Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparables.
arXiv Detail & Related papers (2025-09-19T17:57:13Z)
Spectrogram Patch Codec: A 2D Block-Quantized VQ-VAE and HiFi-GAN for Neural Speech Coding [0.0]
We present a neural speech that challenges the need for complex residual vector quantization stacks by introducing a simpler, single-stage quantization approach.<n>Our method operates directly on the mel-spectrogram, treating it as a 2D data and quantizing non-overlapping 4x4 patches into a single, shared codebook.<n>This patchwise design simplifies the architecture, enables low-latency streaming, and yields a discrete latent grid.
arXiv Detail & Related papers (2025-09-02T12:14:41Z)
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models [62.38713281234756]
Binaural rendering pipeline aims to synthesize audio that mimics natural hearing based on a mono audio.<n>Many methods have been proposed to solve this problem, but they struggle with rendering quality and streamable inference.<n>We propose a flow matching based streaming speech framework called BinauralFlow synthesis framework.
arXiv Detail & Related papers (2025-05-28T20:59:15Z)
FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [10.14555083237668]
FlowMAC is a novel neural audio for high-quality general audio compression at low bit rates based on conditional flow matching (CFM) FlowMAC achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate.
arXiv Detail & Related papers (2024-09-26T08:32:31Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework. It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain [16.70806998451696]
We propose a mask-based post-filter operating directly in MDCT domain, inducing no extra delay. The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network. Our solution is tested on the recently standardized low-delay, low-complexity (LC3) at lowest possible coefficients of 16 kbps.
arXiv Detail & Related papers (2022-01-28T11:08:02Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder [29.63675159839434]
Flow-based neural vocoder has shown significant improvement in real-time speech generation task. We propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation.
arXiv Detail & Related papers (2020-08-16T09:37:18Z)
Efficient Adaptation of Neural Network Filter for Video Compression [10.769305738505071]
We present an efficient finetuning methodology for neural-network filters. The fine-tuning is performed at encoder side to adapt the neural network to the specific content that is being encoded. The proposed method achieves much faster than conventional finetuning approaches.
arXiv Detail & Related papers (2020-07-28T14:24:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.