StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
- URL: http://arxiv.org/abs/2509.22220v1
- Date: Fri, 26 Sep 2025 11:32:51 GMT
- Title: StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
- Authors: Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou,
- Abstract summary: Speech tokenizers are not robust to meaning-irrelevant acoustic perturbations.<n>This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal.<n>We introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism.
- Score: 54.229363096087866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.
Related papers
- Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes [10.877713536966601]
Longestahead Prefix (LSP) scheduler is a training-free and model-agnostic inference paradigm based on monolithic prefix absorption.<n>LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions.<n>It snaps its boundary to natural linguistic or structural acceptances before an atomic commitment.
arXiv Detail & Related papers (2026-03-05T18:25:26Z) - Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference [58.189320101488725]
DLLMs promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding.<n>We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency.<n>We propose ReMix, a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state.
arXiv Detail & Related papers (2026-02-26T11:08:11Z) - Noise Stability of Transformer Models [28.608164171197483]
We argue that average sensitivity lacks a natural generalization to real-valued domains.<n>Noise stability expresses a model's robustness to correlated noise applied to coordinates simultaneously.<n>Our results sculpt a new connection between signal propagation in neural networks and interpretability.
arXiv Detail & Related papers (2026-02-09T05:43:22Z) - Frontend Token Enhancement for Token-Based Speech Recognition [50.35062963870211]
Discretized representations of speech signals are efficient alternatives to continuous features for speech recognition applications.<n>In this work, we introduce a system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens.<n>We consider four types of enhancement models based on their input/token domains: wave-to-wave, token-to-output, continuous SSL features-to-token, and wave-to-token.
arXiv Detail & Related papers (2026-02-04T05:02:15Z) - CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z) - DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion [28.204167153140506]
Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models.<n>We propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens.
arXiv Detail & Related papers (2026-01-14T07:22:24Z) - Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR [37.09163295946173]
We propose disentangling semantic speech content from background noise in the latent space.<n>Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors.<n>We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance.
arXiv Detail & Related papers (2025-10-29T04:08:19Z) - Semantic Fusion with Fuzzy-Membership Features for Controllable Language Modelling [0.0]
semantic fusion is a lightweight scheme that augments a Transformer language model (LM) with a fuzzy-membership feature channel.<n>Each token is represented by a vector of interpretable features whose values are graded degrees from differentiable membership functions.<n>This approach adds only small overhead, remains fully compatible with tied input-output embeddings, and provides an interpretable pathway for conditioned natural language generation.
arXiv Detail & Related papers (2025-09-14T22:11:09Z) - New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR [30.00166986946003]
We take a new insight to regard alignment and matching as a detection problem.<n>The goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens.<n>We propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries.
arXiv Detail & Related papers (2025-09-06T05:58:52Z) - Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise [9.536089523962486]
We propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise.<n>We show that ANPrompt consistently outperforms existing prompt tuning methods.<n>It offers superior robustness to semantic noise and improved generalization across tasks.
arXiv Detail & Related papers (2025-08-06T17:42:30Z) - LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization [8.365515332927444]
Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models.<n>We propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation.<n>We show that LM-SPT achieves superior reconstruction fidelity compared to baselines.
arXiv Detail & Related papers (2025-06-20T04:15:14Z) - Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.<n>Specifically, we propose a self-supervised learning framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation.<n>This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued token based language modeling approach for text-to-speech synthesis (TTS)<n>MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.<n>MELLE mitigates robustness issues by avoiding the inherent flaws of sampling vector-quantized codes.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation [91.83820250747935]
Pseudo-label noise is mainly contained in unstable samples in which predictions of most pixels undergo significant variations during self-training.
We introduce the Stable Neighbor Denoising (SND) approach, which effectively discovers highly correlated stable and unstable samples.
SND consistently outperforms state-of-the-art methods in various SFUDA semantic segmentation settings.
arXiv Detail & Related papers (2024-06-10T21:44:52Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Weak-Attention Suppression For Transformer Based Speech Recognition [33.30436927415777]
We propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities.
We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines.
arXiv Detail & Related papers (2020-05-18T23:49:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.