Related papers: Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models

Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models

URL: http://arxiv.org/abs/2506.13300v3
Date: Wed, 18 Jun 2025 06:57:58 GMT
Title: Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models
Authors: Bo Li, Chengben Xu, Wufeng Zhang,
Abstract summary: Seewo's systems for both tracks of the Multilingual Conversational Speech Language Model Challenge (MLC-SLM)<n>We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR.
Score: 4.917936997225074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents Seewo's systems for both tracks of the Multilingual Conversational Speech Language Model Challenge (MLC-SLM), addressing automatic speech recognition (ASR) and speaker diarization with ASR (SD-ASR). We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR. Our approach combines curriculum learning for progressive capability acquisition, Chain-of-Thought data augmentation to foster intermediate reflection, and Reinforcement Learning with Verifiable Rewards (RLVR) to further refine self-correction through reward-driven optimization. This approach achieves substantial improvements over the official challenge baselines. On the evaluation set, our best system attains a WER/CER of 11.57% for Track 1 and a tcpWER/tcpCER of 17.67% for Track 2. Comprehensive ablation studies demonstrate the effectiveness of each component under challenge constraints.

Related papers

Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning [8.717610965852037]
Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech.<n>This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass.
arXiv Detail & Related papers (2025-09-19T14:33:05Z)
ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition [14.909498057247648]
This work proposes an innovative training paradigm Iterative LoRA Training (ILT) in combination with an Iterative Pseudo Labeling strategy.<n>Based on Whisper-large-v3 and Qwen2-Audio, we conduct systematic experiments using a three-stage training process: Focus Training, Feed Back Training, and Fix Training.<n>The MegaAIS research team applied this technique in the Interspeech 2025 Multilingual Conversational Speech Language Modeling Challenge (MLC-SLM), achieving 4th in Track 1 and 1st place in Track 2.
arXiv Detail & Related papers (2025-07-11T10:38:51Z)
Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z)
MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.<n>This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z)
Multi-stage Large Language Model Correction for Speech Recognition [10.995600950995021]
We propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of large language models (LLMs) Our experimental results demonstrate the effectiveness of the proposed method by showing 10% 20% relative improvement in WER over competitive ASR systems.
arXiv Detail & Related papers (2023-10-17T19:02:40Z)
Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z)
Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one. We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance. We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z)
Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech [17.602098162338137]
We explore a multimodal semi-supervised learning approach for punctuation prediction. We learn representations from large amounts of unlabelled audio and text data. When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
arXiv Detail & Related papers (2020-08-03T08:13:09Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.