Related papers: Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models

Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models

URL: http://arxiv.org/abs/2505.17446v2
Date: Sat, 31 May 2025 13:32:13 GMT
Title: Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models
Authors: Shunsuke Kando, Yusuke Miyao, Shinnosuke Takamichi,
Abstract summary: Speech tokenization transforms a speech signal into a sequence of discrete representations.<n>This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units.
Score: 16.1461487947151
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech language models (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units. First, we segment speech signals into fixed/variable widths and pooled representations. We then train K-means models in multiple cluster sizes. Through the evaluation on zero-shot spoken language understanding benchmarks, we find the positive effect of moderately coarse segmentation and bigger cluster size. Notably, among the best-performing models, the most efficient one achieves a 50% reduction in training data and a 70% decrease in training runtime. Our analysis highlights the importance of combining multiple tokens to enhance fine-grained spoken language understanding.

Related papers

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification [0.0]
A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one.<n>Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary.<n>We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation.
arXiv Detail & Related papers (2025-12-08T14:05:40Z)
Scaling Spoken Language Models with Syllabic Speech Tokenization [17.835120807367677]
Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models.<n>Recent SSL work introduces acoustic tokenization of speech at the syllable level.<n>Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs.
arXiv Detail & Related papers (2025-09-30T17:59:09Z)
Entropy-based Coarse and Compressed Semantic Speech Representation Learning [72.18542411704347]
We propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations.<n> Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences.
arXiv Detail & Related papers (2025-08-30T13:50:58Z)
Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs [59.230858581944425]
Two dominant approaches have emerged for speech processing: discrete tokens and continuous features.<n>We compare self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings.<n>Our findings reveal that continuous features generally outperform discrete tokens in various tasks.
arXiv Detail & Related papers (2025-08-25T10:16:07Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.<n>Specifically, we propose a self-supervised learning framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation.<n>This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling.
arXiv Detail & Related papers (2024-10-09T17:59:04Z)
SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering. SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z)
dMel: Speech Tokenization made Simple [16.679015298503593]
We introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins.<n>Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation.
arXiv Detail & Related papers (2024-07-22T17:51:53Z)
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z)
Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers. The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z)
SLICER: Learning universal audio representations using low-resource self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z)
Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.