Related papers: FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

URL: http://arxiv.org/abs/2509.16195v1
Date: Fri, 19 Sep 2025 17:57:13 GMT
Title: FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli,
Abstract summary: FocalCodec-Stream is a hybrid that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms.<n> Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparables.
Score: 27.32235541083431
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

Related papers

Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z)
CoD: A Diffusion Foundation Model for Image Compression [57.572664625372106]
Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion.<n>textbfCoD can be trained from scratch to enable end-to-end optimization of both compression and generation.
arXiv Detail & Related papers (2025-11-24T03:00:15Z)
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z)
NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference [19.201753265782685]
Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens.<n>Existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models.<n>We introduce NanoCodec, a state-of-the-art audio that achieves high-quality compression at just 12.5 frames per second (FPS)
arXiv Detail & Related papers (2025-08-07T20:20:32Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
Towards Generalized Source Tracing for Codec-Based Deepfake Speech [52.68106957822706]
We introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding.<n>Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
arXiv Detail & Related papers (2025-06-08T21:36:10Z)
MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation [19.998635838159217]
MagiCodec is a novel single-layer, streaming Transformer-based audio.<n>We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in atten high-frequency components.<n>We show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks.
arXiv Detail & Related papers (2025-05-31T04:31:02Z)
FlowDec: A flow-based full-band general audio codec with high perceptual quality [90.05968801459524]
FlowDec is a neural full-band audio codecs for general audio sampled at 48 kHz.<n>We generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s.
arXiv Detail & Related papers (2025-03-03T12:49:09Z)
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.<n>Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.