Related papers: TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge

TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge

URL: http://arxiv.org/abs/2506.01458v1
Date: Mon, 02 Jun 2025 09:16:09 GMT
Title: TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge
Authors: Tanel Alumäe, Artem Fedorchenko,
Abstract summary: A hybrid language identification system is used, consisting of a pretrained language embedding model and a light-weight speech recognition model with a shared encoder across languages.<n>For speech recognition, three models are used, where only a single model is applied for each language, depending on the training data availability and performance on held-out data.<n>The system obtained the top overall score in the challenge.
Score: 4.297070083645049
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper describes the language identification and multilingual speech recognition system developed at Tallinn University of Technology for the Interspeech 2025 ML-SUPERB 2.0 Challenge. A hybrid language identification system is used, consisting of a pretrained language embedding model and a light-weight speech recognition model with a shared encoder across languages and language-specific bigram language models. For speech recognition, three models are used, where only a single model is applied for each language, depending on the training data availability and performance on held-out data. The model set consists of a finetuned version of SeamlessM4T, MMS-1B-all with custom language adapters and MMS-zeroshot. The system obtained the top overall score in the challenge.

Related papers

Transsion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge [18.816408172588144]
This paper presents the architecture and performance of a novel Multilingual Automatic Speech Recognition (ASR) system developed by the Transsion Speech Team for Track 1 of the MLC-SLM 2025 Challenge.<n>The proposed system comprises three key components: 1) a frozen Whisper-large-v3 based speech encoder, leveraging large-scale pretraining to ensure robust acoustic feature extraction.<n>By systematically combining pretrained models with task specific fine-tuning, the system achieved a word/character error rate (WER/CER) of 9.83% across 11 languages in the evaluation set and ranked third place among global participants.
arXiv Detail & Related papers (2025-08-15T10:39:05Z)
Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond [87.4049283495551]
The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework.<n>The challenge garnered 12 model submissions and 54 language corpora, resulting in a comprehensive benchmark encompassing 154 languages.<n>The findings indicate that merely scaling models is not the definitive solution for multilingual speech tasks.
arXiv Detail & Related papers (2023-10-09T08:30:01Z)
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages. We developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. Main ingredients are a new dataset based on readings of publicly available religious texts. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z)
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z)
Language-Universal Adapter Learning with Knowledge Distillation for End-to-End Multilingual Speech Recognition [28.416831396722106]
We propose a language-universal adapter learning framework based on a pre-trained model for end-to-end multilingual automatic speech recognition. An online knowledge distillation is then used to enable the language-universal adapters to learn both language-specific and universal features. Compared to the conventional multilingual model, a 3.3% absolute error rate reduction is achieved.
arXiv Detail & Related papers (2023-02-28T14:43:49Z)
Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model. We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z)
Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge [0.0]
The paper is based on our submission to the Oriental Language Recognition 2021 Challenge. For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition. For the unconstrained task, we relied on both externally available pretrained models as well as external data.
arXiv Detail & Related papers (2022-05-14T15:17:08Z)
Code Switched and Code Mixed Speech Recognition for Indic languages [0.0]
Training multilingual automatic speech recognition (ASR) systems is challenging because acoustic and lexical information is typically language specific. We compare the performance of end to end multilingual speech recognition system to the performance of monolingual models conditioned on language identification (LID) We also propose a similar technique to solve the Code Switched problem and achieve a WER of 21.77 and 28.27 over Hindi-English and Bengali-English respectively.
arXiv Detail & Related papers (2022-03-30T18:09:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.