Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems
- URL: http://arxiv.org/abs/2506.13596v2
- Date: Mon, 07 Jul 2025 09:09:16 GMT
- Title: Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems
- Authors: Tuan Nguyen, Long-Vu Hoang, Huy-Dat Tran,
- Abstract summary: This paper focuses on multilingual speech recognition and language modeling with large language models (LLMs) for the MLC-SLM Challenge 2025.<n>Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.
- Score: 2.9034429823924865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.
Related papers
- WavLink: Compact Audio-Text Embeddings with a Global Whisper Token [4.000493292896401]
We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token.<n>Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop.
arXiv Detail & Related papers (2026-01-21T15:55:58Z) - Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR [16.090902570653803]
We present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations.<n>Our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems.
arXiv Detail & Related papers (2026-01-04T10:08:53Z) - JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation [108.21827580870979]
This paper presents JavisG, the first unified multimodal language model (MLLM) for joint audio-video (JAV) comprehension and generation.<n>JavisG has a encoder-LLM-decoder architecture, which has a SyncFusion module for concise-temporal large audio-video fusion.<n>We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning.
arXiv Detail & Related papers (2025-12-28T12:25:43Z) - Transsion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge [18.816408172588144]
This paper presents the architecture and performance of a novel Multilingual Automatic Speech Recognition (ASR) system developed by the Transsion Speech Team for Track 1 of the MLC-SLM 2025 Challenge.<n>The proposed system comprises three key components: 1) a frozen Whisper-large-v3 based speech encoder, leveraging large-scale pretraining to ensure robust acoustic feature extraction.<n>By systematically combining pretrained models with task specific fine-tuning, the system achieved a word/character error rate (WER/CER) of 9.83% across 11 languages in the evaluation set and ranked third place among global participants.
arXiv Detail & Related papers (2025-08-15T10:39:05Z) - Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge [24.966911190845817]
This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge.<n>Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture.
arXiv Detail & Related papers (2025-07-23T07:48:33Z) - SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge [3.9836024799656053]
Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework.<n>The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the INTERSPEECH 2025 MLC-SLM Challenge.
arXiv Detail & Related papers (2025-07-04T07:10:33Z) - NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 [24.056321452209666]
This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I)<n>We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies.
arXiv Detail & Related papers (2025-06-16T10:28:27Z) - M-Prometheus: A Suite of Open Multilingual LLM Judges [64.22940792713713]
We introduce M-Prometheus, a suite of open-weight LLM judges that can provide both direct assessment and pairwise comparison feedback on multilingual outputs.<n>M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs.
arXiv Detail & Related papers (2025-04-07T11:37:26Z) - Zero-resource Speech Translation and Recognition with LLMs [38.11535502039386]
We propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data.<n>We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM.
arXiv Detail & Related papers (2024-12-24T17:37:11Z) - Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding [27.499426765845705]
Code-switching automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches.<n>We adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts.
arXiv Detail & Related papers (2024-12-21T07:06:44Z) - Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs [50.17767479660832]
Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input.
We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
arXiv Detail & Related papers (2023-07-13T17:51:58Z) - ESPnet-ST IWSLT 2021 Offline Speech Translation System [56.83606198051871]
This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track.
This year we made various efforts on training data, architecture, and audio segmentation.
Our best E2E system combined all the techniques with model ensembling and achieved 31.4 BLEU.
arXiv Detail & Related papers (2021-07-01T17:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.