Related papers: CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space

CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space

URL: http://arxiv.org/abs/2603.02022v2
Date: Tue, 03 Mar 2026 06:49:12 GMT
Title: CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space
Authors: Bowen Zhang, Junchuan Zhao, Ian McLoughlin, Ye Wang, A S Madhukumar,
Abstract summary: Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech.<n>Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity.<n>We present CodecFlow, a neural-based BWE framework that performs efficient speech reconstruction in a compact latent space.
Score: 13.286622421661313
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech. Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity. Neural audio codecs offer compact latent representations that better preserve acoustic detail, yet accurately recovering high-resolution latent information remains challenging due to representation mismatch. We present CodecFlow, a neural codec-based BWE framework that performs efficient speech reconstruction in a compact latent space. CodecFlow employs a voicing-aware conditional flow converter on continuous codec embeddings and a structure-constrained residual vector quantizer to improve latent alignment stability. Optimized end-to-end, CodecFlow achieves strong spectral fidelity and enhanced perceptual quality on 8 kHz to 16 kHz and 44.1 kHz speech BWE tasks.

Related papers

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis [62.09761127079914]
Latent-Mark is the first zero-bit audio watermarking framework designed to survive semantic compression.<n>Our key insight is that robustness to the encode-decode process requires embedding the watermark within the invariant latent space.<n>Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
arXiv Detail & Related papers (2026-03-05T15:51:09Z)
U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation [71.59514998928833]
U-Codec achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz.<n>We apply U-Codec into a large language model (LLM)-based auto-regressive TTS model.
arXiv Detail & Related papers (2025-10-19T05:09:20Z)
FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation [27.32235541083431]
FocalCodec-Stream is a hybrid that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms.<n> Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparables.
arXiv Detail & Related papers (2025-09-19T17:57:13Z)
CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio [7.093237513313511]
CoDiCodec is a novel audio autoencoder that overcomes limitations by both efficiently encoding global features via summary embeddings.<n>It produces both compressed continuous embeddings at 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model.<n>Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms.
arXiv Detail & Related papers (2025-09-11T20:31:18Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z)
Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate [14.03590336780589]
We propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time.<n>TFC enables seamlessly average frame rates and dynamically allocates frame rates based on temporal entropy.<n> Experimental results show that a neural reconstruction with TFC achieves optimal quality with high flexibility, and maintains competitive performance even at lower frame rates.
arXiv Detail & Related papers (2025-05-22T16:10:01Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.<n>We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.<n>We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos. Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs. A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
Latent-Domain Predictive Neural Speech Coding [33.458968443594415]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.<n>We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.<n>Subjective results on speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z)
Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components. Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio. We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.