Related papers: NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

URL: http://arxiv.org/abs/2508.05835v1
Date: Thu, 07 Aug 2025 20:20:32 GMT
Title: NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference
Authors: Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukić, Jason Li, Boris Ginsburg,
Abstract summary: Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens.<n>Existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models.<n>We introduce NanoCodec, a state-of-the-art audio that achieves high-quality compression at just 12.5 frames per second (FPS)
Score: 19.201753265782685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.

Related papers

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation [27.32235541083431]
FocalCodec-Stream is a hybrid that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms.<n> Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparables.
arXiv Detail & Related papers (2025-09-19T17:57:13Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z)
Towards Generalized Source Tracing for Codec-Based Deepfake Speech [52.68106957822706]
We introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding.<n>Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
arXiv Detail & Related papers (2025-06-08T21:36:10Z)
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.<n>Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z)
Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference [10.909997817643905]
We present the Low Frame-rate Speech Codec (LFSC): a neural audio that leverages a finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps and 21.5 frames per second. We demonstrate that our novel LLM can make the inference of text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
arXiv Detail & Related papers (2024-09-18T16:39:10Z)
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [40.810505707522324]
SemantiCodec is designed to compress audio into fewer than a hundred tokens per second across diverse audio types.<n>We show that SemantiCodec significantly outperforms the state-of-the-art Descript on reconstruction quality.<n>Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs.
arXiv Detail & Related papers (2024-04-30T22:51:36Z)
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [55.95078490630001]
This paper presents FunCodec, a fundamental neural speech toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech models, such as SoundStream and Encodec. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes.
arXiv Detail & Related papers (2023-09-14T03:18:24Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency. Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.