FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
- URL: http://arxiv.org/abs/2509.16195v1
- Date: Fri, 19 Sep 2025 17:57:13 GMT
- Title: FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
- Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli,
- Abstract summary: FocalCodec-Stream is a hybrid that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms.<n> Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparables.
- Score: 27.32235541083431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.
Related papers
- Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z) - CoD: A Diffusion Foundation Model for Image Compression [57.572664625372106]
Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion.<n>textbfCoD can be trained from scratch to enable end-to-end optimization of both compression and generation.
arXiv Detail & Related papers (2025-11-24T03:00:15Z) - Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z) - NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference [19.201753265782685]
Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens.<n>Existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models.<n>We introduce NanoCodec, a state-of-the-art audio that achieves high-quality compression at just 12.5 frames per second (FPS)
arXiv Detail & Related papers (2025-08-07T20:20:32Z) - SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z) - Towards Generalized Source Tracing for Codec-Based Deepfake Speech [52.68106957822706]
We introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding.<n>Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
arXiv Detail & Related papers (2025-06-08T21:36:10Z) - MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation [19.998635838159217]
MagiCodec is a novel single-layer, streaming Transformer-based audio.<n>We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in atten high-frequency components.<n>We show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks.
arXiv Detail & Related papers (2025-05-31T04:31:02Z) - FlowDec: A flow-based full-band general audio codec with high perceptual quality [90.05968801459524]
FlowDec is a neural full-band audio codecs for general audio sampled at 48 kHz.<n>We generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s.
arXiv Detail & Related papers (2025-03-03T12:49:09Z) - FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.<n>Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.