Related papers: The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

URL: http://arxiv.org/abs/2510.22492v1
Date: Sun, 26 Oct 2025 02:13:26 GMT
Title: The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR
Authors: Siyu Liang, Nicolas Ballier, Gina-Anne Levow, Richard Wright,
Abstract summary: We analyze Whisper's decoding behavior during inference across 49 languages.<n>We study the utilization pattern of the model's sub-token space.
Score: 6.627057618324123
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How much audio is needed to fully observe a multilingual ASR model's learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these tokens are utilized during inference? We address this question by analyzing Whisper's decoding behavior during inference across 49 languages. By logging decoding candidate sub-tokens and tracking their cumulative discovery over time, we study the utilization pattern of the model's sub-token space. Results show that the total number of discovered tokens remains largely independent of a language's pre-training hours, indicating that data disparity does not strongly influence lexical diversity in the model's hypothesis space. Sub-token discovery rates follow a consistent exponential saturation pattern across languages, suggesting a stable time window after which additional audio yields minimal new sub-token activation. We refer to this convergence threshold as acoustic saturation time (AST). Further analyses of rank-frequency distributions reveal Zipf-like patterns better modeled by a Zipf-Mandelbrot law, and mean sub-token length shows a positive correlation with resource level. Additionally, those metrics show more favorable patterns for languages in the Latin script than those in scripts such as Cyrillic, CJK, and Semitic. Together, our study suggests that sub-token utilization during multilingual ASR inference is constrained more by the statistical, typological, and orthographic structure of the speech than by training data scale, providing an empirical basis for more equitable corpus construction and cross-lingual evaluation.

Related papers

PRiSM: Benchmarking Phone Realization in Speech Models [70.82595415252682]
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis.<n>We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception.
arXiv Detail & Related papers (2026-01-20T15:00:36Z)
Scaling Spoken Language Models with Syllabic Speech Tokenization [17.835120807367677]
Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models.<n>Recent SSL work introduces acoustic tokenization of speech at the syllable level.<n>Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs.
arXiv Detail & Related papers (2025-09-30T17:59:09Z)
Beyond WER: Probing Whisper's Sub-token Decoder Across Diverse Language Resource Levels [6.627057618324123]
This paper introduces a fine-grained analysis of Whisper's multilingual decoder.<n>Our method traces the beam search path, capturing sub-token guesses and their associated probabilities.<n>Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage.
arXiv Detail & Related papers (2025-09-29T21:20:05Z)
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks [7.216732751280017]
We correlate Tokenization Parity (TP) and Information Parity (IP) as measures of representational biases in pre-trained multilingual models.<n>We compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering.<n>Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues, while IP better predicts performance in semantic tasks.
arXiv Detail & Related papers (2025-09-24T12:13:53Z)
Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs [59.230858581944425]
Two dominant approaches have emerged for speech processing: discrete tokens and continuous features.<n>We compare self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings.<n>Our findings reveal that continuous features generally outperform discrete tokens in various tasks.
arXiv Detail & Related papers (2025-08-25T10:16:07Z)
SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering. SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z)
Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
This paper examines how to expand the speech translation capability of speech foundation models with restricted data.<n>Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model.<n>Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space.
arXiv Detail & Related papers (2024-07-01T09:51:48Z)
Establishing degrees of closeness between audio recordings along different dimensions using large-scale cross-lingual models [4.349838917565205]
We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata. Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects. The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines.
arXiv Detail & Related papers (2024-02-08T11:31:23Z)
VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments. Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs. token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z)
Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years. We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data. Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z)
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.