TokenChain: A Discrete Speech Chain via Semantic Token Modeling
- URL: http://arxiv.org/abs/2510.06201v1
- Date: Tue, 07 Oct 2025 17:54:12 GMT
- Title: TokenChain: A Discrete Speech Chain via Semantic Token Modeling
- Authors: Mingxuan Wang, Satoshi Nakamura,
- Abstract summary: TokenChain is a discrete speech chain coupling semantic-token ASR with a two-stage TTS.<n>End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax.<n> Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech.
- Score: 28.053602247858674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.
Related papers
- Frontend Token Enhancement for Token-Based Speech Recognition [50.35062963870211]
Discretized representations of speech signals are efficient alternatives to continuous features for speech recognition applications.<n>In this work, we introduce a system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens.<n>We consider four types of enhancement models based on their input/token domains: wave-to-wave, token-to-output, continuous SSL features-to-token, and wave-to-token.
arXiv Detail & Related papers (2026-02-04T05:02:15Z) - Entropy-based Coarse and Compressed Semantic Speech Representation Learning [72.18542411704347]
We propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations.<n> Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences.
arXiv Detail & Related papers (2025-08-30T13:50:58Z) - A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data [46.73430446242378]
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets.<n>We demonstrate the effectiveness of the framework on Taiwanese Mandarin speech.
arXiv Detail & Related papers (2025-06-10T17:30:32Z) - FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching [56.30231216917128]
FELLE is an autoregressive model that integrates language modeling with token-wise flow matching.<n>For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step.<n>FELLE generates continuous-valued tokens hierarchically, conditioned on the language model's output.
arXiv Detail & Related papers (2025-02-16T13:54:32Z) - SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic
Token Prediction [14.661123738628772]
We introduce a text-to-speech(TTS) framework based on a neural transducer.
We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints.
arXiv Detail & Related papers (2023-11-06T06:13:39Z) - Improved Consistency Training for Semi-Supervised Sequence-to-Sequence
ASR via Speech Chain Reconstruction and Self-Transcribing [21.049557187137776]
We propose an improved consistency training paradigm of semi-supervised S2S ASR.
We utilize speech chain reconstruction as the weak augmentation to generate high-quality pseudo labels.
Our improved paradigm achieves a 12.2% CER improvement in the single-speaker setting and 38.6% in the multi-speaker setting.
arXiv Detail & Related papers (2022-05-14T04:26:13Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Exploring Machine Speech Chain for Domain Adaptation and Few-Shot
Speaker Adaptation [11.79922306758482]
Machine Speech Chain integrates end-to-end automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training.
We explore the TTS->ASR pipeline in speech chain to do domain adaptation for both neural TTS and E2E ASR models.
arXiv Detail & Related papers (2021-04-08T14:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.