Related papers: Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

URL: http://arxiv.org/abs/2601.23174v2
Date: Wed, 04 Feb 2026 18:42:12 GMT
Title: Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli,
Abstract summary: We introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer.<n>DyCAST learns to associate tokens with character-level linguistic units during training.<n>We also introduce a retrieval-augmented decoding mechanism that enhances reconstruction without increasing fidelity.
Score: 27.32235541083431
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.

Related papers

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models [56.76440182038839]
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos.<n>Current methods use sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage.<n>We propose to leverage video primitives which encode video redundancy and sparsity without requiring expensive full-image encoding for most frames.
arXiv Detail & Related papers (2026-02-13T18:57:31Z)
NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference [19.201753265782685]
Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens.<n>Existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models.<n>We introduce NanoCodec, a state-of-the-art audio that achieves high-quality compression at just 12.5 frames per second (FPS)
arXiv Detail & Related papers (2025-08-07T20:20:32Z)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate [14.03590336780589]
We propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time.<n>TFC enables seamlessly average frame rates and dynamically allocates frame rates based on temporal entropy.<n> Experimental results show that a neural reconstruction with TFC achieves optimal quality with high flexibility, and maintains competitive performance even at lower frame rates.
arXiv Detail & Related papers (2025-05-22T16:10:01Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [33.022035588157614]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.<n>Demo samples and code are available at https://lucadellalib.io/kbpscodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z)
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [36.61105228468503]
X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization stage.<n>X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications.<n>Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.
arXiv Detail & Related papers (2024-08-30T10:24:07Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X) NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z)
PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders [9.998721582869438]
PSCodec is a series of neural speech codecs based on prompt encoders. PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN are capable of delivering high-performance speech reconstruction with low bandwidths.
arXiv Detail & Related papers (2024-04-03T13:00:08Z)
RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization. We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.