Transsion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge
- URL: http://arxiv.org/abs/2508.14916v1
- Date: Fri, 15 Aug 2025 10:39:05 GMT
- Title: Transsion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge
- Authors: Xiaoxiao Li, An Zhu, Youhai Jiang, Fengjie Zhu,
- Abstract summary: This paper presents the architecture and performance of a novel Multilingual Automatic Speech Recognition (ASR) system developed by the Transsion Speech Team for Track 1 of the MLC-SLM 2025 Challenge.<n>The proposed system comprises three key components: 1) a frozen Whisper-large-v3 based speech encoder, leveraging large-scale pretraining to ensure robust acoustic feature extraction.<n>By systematically combining pretrained models with task specific fine-tuning, the system achieved a word/character error rate (WER/CER) of 9.83% across 11 languages in the evaluation set and ranked third place among global participants.
- Score: 18.816408172588144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the architecture and performance of a novel Multilingual Automatic Speech Recognition (ASR) system developed by the Transsion Speech Team for Track 1 of the MLC-SLM 2025 Challenge. The proposed system comprises three key components: 1) a frozen Whisper-large-v3 based speech encoder, leveraging large-scale pretraining to ensure robust acoustic feature extraction; 2) a trainable adaptor module using Linear-ReLU-Linear transformation mechanisms to effectively align speech and text representations; and 3) a frozen Qwen2.5-7B-Instruct large language model (LLM) integrated with trainable LoRA for optimized contextual linguistic decoding. By systematically combining pretrained models with task specific fine-tuning, the system achieved a word/character error rate (WER/CER) of 9.83% across 11 languages in the evaluation set and ranked third place among global participants.
Related papers
- AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning [49.68129589035101]
We introduce AZeroS (Auden Zero-instruction-tuned Speech-LLM), which is trained on speech-text pairs derived from publicly available corpora.<n>AZeroS achieves state-of-the-art performance on both semantic and paralinguistic benchmarks.
arXiv Detail & Related papers (2025-12-31T04:05:04Z) - Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge [24.966911190845817]
This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge.<n>Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture.
arXiv Detail & Related papers (2025-07-23T07:48:33Z) - SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge [3.9836024799656053]
Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework.<n>The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the INTERSPEECH 2025 MLC-SLM Challenge.
arXiv Detail & Related papers (2025-07-04T07:10:33Z) - Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems [2.9034429823924865]
This paper focuses on multilingual speech recognition and language modeling with large language models (LLMs) for the MLC-SLM Challenge 2025.<n>Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.
arXiv Detail & Related papers (2025-06-16T15:23:07Z) - TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge [4.297070083645049]
A hybrid language identification system is used, consisting of a pretrained language embedding model and a light-weight speech recognition model with a shared encoder across languages.<n>For speech recognition, three models are used, where only a single model is applied for each language, depending on the training data availability and performance on held-out data.<n>The system obtained the top overall score in the challenge.
arXiv Detail & Related papers (2025-06-02T09:16:09Z) - Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [83.0622534215881]
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities.<n>Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures.<n>Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL.<n>Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios.
arXiv Detail & Related papers (2025-02-26T17:26:36Z) - Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge.
This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.
We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z) - WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter.
We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z) - A Multimodal Approach to Device-Directed Speech Detection with Large Language Models [41.37311266840156]
We explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase.
We train classifiers using only acoustic information obtained from the audio waveform.
We take the decoder outputs of an automatic speech recognition system, such as 1-best hypotheses, as input features to a large language model.
arXiv Detail & Related papers (2024-03-21T14:44:03Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.