The Eloquence team submission for task 1 of MLC-SLM challenge
- URL: http://arxiv.org/abs/2507.19308v1
- Date: Fri, 25 Jul 2025 14:23:49 GMT
- Title: The Eloquence team submission for task 1 of MLC-SLM challenge
- Authors: Lorenzo Concina, Jordi Luque, Alessio Brutti, Marco Matassoni, Yuchen Zhang,
- Abstract summary: We present our studies and experiments carried out for the task 1 of the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM)<n>This task focuses on advancing multilingual conversational speech recognition through the development of speech language models architectures.
- Score: 12.442720957990533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present our studies and experiments carried out for the task 1 of the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM), which focuses on advancing multilingual conversational speech recognition through the development of speech language models architectures. Given the increasing relevance of real-world conversational data for building robust Spoken Dialogue Systems, we explore three approaches to multilingual ASR. First, we conduct an evaluation of the official baseline to better understand its strengths and limitations, by training two projectors (linear and qformer) with different foundation models. Second we leverage the SLAM-ASR framework to train a custom multilingual linear projector. Finally we investigate the role of contrastive learning and the extended conversational context in enhancing the robustness of recognition.
Related papers
- Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge [24.966911190845817]
This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge.<n>Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture.
arXiv Detail & Related papers (2025-07-23T07:48:33Z) - A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations [25.58593495281218]
We propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner.<n>By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio.
arXiv Detail & Related papers (2025-06-26T01:54:02Z) - Building a Taiwanese Mandarin Spoken Language Model: A First Attempt [44.54200115439157]
This report aims to build a large spoken language model (MLL) for Taiwanese Mandarin tailored to enable realtime speech interaction in multi-turn conversations.<n>Our end-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving conversational fluency flow.
arXiv Detail & Related papers (2024-11-11T16:37:40Z) - A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [51.8203871494146]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.<n>Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.<n>This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z) - Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training [29.47243668154796]
BLOOMZMMS is a novel model that integrates a multilingual LLM with a multilingual speech encoder.
We demonstrate the transferability of linguistic knowledge from the text to the speech modality.
Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks.
arXiv Detail & Related papers (2024-04-16T21:45:59Z) - Towards Joint Modeling of Dialogue Response and Speech Synthesis based
on Large Language Model [8.180382743037082]
This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously.
arXiv Detail & Related papers (2023-09-20T01:48:27Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents [70.08842857515141]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.<n>Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue
Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response.
We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English.
Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z) - A Multi-task Multi-stage Transitional Training Framework for Neural Chat
Translation [84.59697583372888]
Neural chat translation (NCT) aims to translate a cross-lingual chat between speakers of different languages.
Existing context-aware NMT models cannot achieve satisfactory performances due to limited resources of annotated bilingual dialogues.
We propose a multi-task multi-stage transitional (MMT) training framework, where an NCT model is trained using the bilingual chat translation dataset and additional monolingual dialogues.
arXiv Detail & Related papers (2023-01-27T14:41:16Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue
Modeling [52.99188200886738]
BiToD is the first bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling.
BiToD contains over 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base.
arXiv Detail & Related papers (2021-06-05T03:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.