Related papers: StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

URL: http://arxiv.org/abs/2509.22220v1
Date: Fri, 26 Sep 2025 11:32:51 GMT
Title: StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Authors: Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou,
Abstract summary: Speech tokenizers are not robust to meaning-irrelevant acoustic perturbations.<n>This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal.<n>We introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism.
Score: 54.229363096087866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.

Related papers

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes [10.877713536966601]
Longestahead Prefix (LSP) scheduler is a training-free and model-agnostic inference paradigm based on monolithic prefix absorption.<n>LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions.<n>It snaps its boundary to natural linguistic or structural acceptances before an atomic commitment.
arXiv Detail & Related papers (2026-03-05T18:25:26Z)
Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference [58.189320101488725]
DLLMs promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding.<n>We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency.<n>We propose ReMix, a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state.
arXiv Detail & Related papers (2026-02-26T11:08:11Z)
Noise Stability of Transformer Models [28.608164171197483]
We argue that average sensitivity lacks a natural generalization to real-valued domains.<n>Noise stability expresses a model's robustness to correlated noise applied to coordinates simultaneously.<n>Our results sculpt a new connection between signal propagation in neural networks and interpretability.
arXiv Detail & Related papers (2026-02-09T05:43:22Z)
Frontend Token Enhancement for Token-Based Speech Recognition [50.35062963870211]
Discretized representations of speech signals are efficient alternatives to continuous features for speech recognition applications.<n>In this work, we introduce a system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens.<n>We consider four types of enhancement models based on their input/token domains: wave-to-wave, token-to-output, continuous SSL features-to-token, and wave-to-token.
arXiv Detail & Related papers (2026-02-04T05:02:15Z)
CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z)
DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion [28.204167153140506]
Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models.<n>We propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens.
arXiv Detail & Related papers (2026-01-14T07:22:24Z)
Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR [37.09163295946173]
We propose disentangling semantic speech content from background noise in the latent space.<n>Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors.<n>We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance.
arXiv Detail & Related papers (2025-10-29T04:08:19Z)
Semantic Fusion with Fuzzy-Membership Features for Controllable Language Modelling [0.0]
semantic fusion is a lightweight scheme that augments a Transformer language model (LM) with a fuzzy-membership feature channel.<n>Each token is represented by a vector of interpretable features whose values are graded degrees from differentiable membership functions.<n>This approach adds only small overhead, remains fully compatible with tied input-output embeddings, and provides an interpretable pathway for conditioned natural language generation.
arXiv Detail & Related papers (2025-09-14T22:11:09Z)
New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR [30.00166986946003]
We take a new insight to regard alignment and matching as a detection problem.<n>The goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens.<n>We propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries.
arXiv Detail & Related papers (2025-09-06T05:58:52Z)
Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise [9.536089523962486]
We propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise.<n>We show that ANPrompt consistently outperforms existing prompt tuning methods.<n>It offers superior robustness to semantic noise and improved generalization across tasks.
arXiv Detail & Related papers (2025-08-06T17:42:30Z)
LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization [8.365515332927444]
Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models.<n>We propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation.<n>We show that LM-SPT achieves superior reconstruction fidelity compared to baselines.
arXiv Detail & Related papers (2025-06-20T04:15:14Z)
Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.<n>Specifically, we propose a self-supervised learning framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation.<n>This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling.
arXiv Detail & Related papers (2024-10-09T17:59:04Z)
Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued token based language modeling approach for text-to-speech synthesis (TTS)<n>MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.<n>MELLE mitigates robustness issues by avoiding the inherent flaws of sampling vector-quantized codes.
arXiv Detail & Related papers (2024-07-11T14:36:53Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation [91.83820250747935]
Pseudo-label noise is mainly contained in unstable samples in which predictions of most pixels undergo significant variations during self-training. We introduce the Stable Neighbor Denoising (SND) approach, which effectively discovers highly correlated stable and unstable samples. SND consistently outperforms state-of-the-art methods in various SFUDA semantic segmentation settings.
arXiv Detail & Related papers (2024-06-10T21:44:52Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Weak-Attention Suppression For Transformer Based Speech Recognition [33.30436927415777]
We propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities. We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines.
arXiv Detail & Related papers (2020-05-18T23:49:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.