Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration
- URL: http://arxiv.org/abs/2601.02906v1
- Date: Tue, 06 Jan 2026 10:45:04 GMT
- Title: Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration
- Authors: Ryan Soh-Eun Shim, Kwanghee Choi, Kalvin Chang, Ming-Hao Hsu, Florian Eichin, Zhizheng Wu, Alane Suhr, Michael A. Hedderich, David Harwath, David R. Mortensen, Barbara Plank,
- Abstract summary: We show that script is linearly encoded in the activation space of multilingual speech models, and that modifying activations at inference time enables direct control over output script.<n>We apply this approach to inducing post-hoc control over the script of speech recognition output, where we observe competitive performance across all model sizes of Whisper.
- Score: 70.84108518476744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual speech foundation models such as Whisper are trained on web-scale data, where data for each language consists of a myriad of regional varieties. However, different regional varieties often employ different scripts to write the same language, rendering speech recognition output also subject to non-determinism in the output script. To mitigate this problem, we show that script is linearly encoded in the activation space of multilingual speech models, and that modifying activations at inference time enables direct control over output script. We find the addition of such script vectors to activations at test time can induce script change even in unconventional language-script pairings (e.g. Italian in Cyrillic and Japanese in Latin script). We apply this approach to inducing post-hoc control over the script of speech recognition output, where we observe competitive performance across all model sizes of Whisper.
Related papers
- A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge'ez Script [3.5149312379702127]
Homophone normalization is a pre-processing step applied in Amharic Natural Language Processing literature.<n>We propose a post-inference intervention in which normalization is applied to model predictions instead of training data.<n>Our work contributes to the broader discussion on technology-facilitated language change and calls for more language-aware interventions.
arXiv Detail & Related papers (2025-07-20T22:35:08Z) - A two-stage transliteration approach to improve performance of a multilingual ASR [1.9511556030544333]
This paper presents an approach to build a language-agnostic end-to-end model trained on a grapheme set.
We performed experiments with an end-to-end multilingual speech recognition system for two Indic languages.
arXiv Detail & Related papers (2024-10-09T05:30:33Z) - LangSAMP: Language-Script Aware Multilingual Pretraining [48.16511046793275]
We propose Language-Script Aware Multilingual Pretraining (LangSAMP)<n>LangSAMP incorporates both language and script embeddings to enhance representation learning.<n>We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages.
arXiv Detail & Related papers (2024-09-26T18:29:10Z) - Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script.
We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Script Normalization for Unconventional Writing of Under-Resourced
Languages in Bilingual Communities [36.578851892373365]
Social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages.
This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script.
Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated.
arXiv Detail & Related papers (2023-05-25T18:18:42Z) - Towards Zero-Shot Code-Switched Speech Recognition [44.76492452463019]
We seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting.
We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script.
We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets.
arXiv Detail & Related papers (2022-11-02T19:52:54Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.