Related papers: Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries

Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries

URL: http://arxiv.org/abs/2601.18899v2
Date: Mon, 02 Feb 2026 18:02:52 GMT
Title: Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries
Authors: Yuchen Zhang, Ravi Shekhar, Haralambos Mouratidis,
Abstract summary: Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources.<n>We propose an efficient and novel connector-sharing strategy based on linguistic family membership.
Score: 5.770962296305264
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.

Related papers

Multimodal In-context Learning for ASR of Low-resource Languages [16.078416187950207]
In-context learning (ICL) with large language models (LLMs) addresses this problem.<n>This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL)<n>Cross-lingual transfer learning improves MICL efficiency on target languages without training on them.
arXiv Detail & Related papers (2026-01-09T10:52:23Z)
A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR [15.703835740288504]
Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs.<n>We propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation.
arXiv Detail & Related papers (2026-01-02T04:08:39Z)
PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs [58.2469845374385]
We introduce Progressive Alignment Representation Training (PART)<n>PART is a multi-stage and multi-task framework that separates within-language from cross-language alignment.<n>Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches.
arXiv Detail & Related papers (2025-09-24T03:54:14Z)
Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining [16.590296049892576]
This paper introduces Climb, a novel framework designed to systematically optimize multilingual data allocation.<n>At its core, Climb introduces a cross-lingual interaction-aware language ratio, explicitly quantifying each language's effective allocation by capturing inter-language dependencies.<n>Extensive experiments confirm that Climb can accurately measure cross-lingual interactions across various multilingual settings.
arXiv Detail & Related papers (2025-09-19T03:34:34Z)
Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.<n>However, Whisper struggles with unseen languages, those not included in its pre-training.<n>We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z)
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z)
Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM [1.3089936156875277]
We introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector. We propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task. We also present a connecter with MoE architecture that manages multiple languages efficiently.
arXiv Detail & Related papers (2024-09-24T09:20:22Z)
Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora [13.891322931352649]
We propose a Code-Switched Large Language Model (CS-LLM) to enhance the code-switched text-to-speech synthesis (CS TTS) capability.<n>Specifically, we begin by enhancing the multilingual speech processing ability of LLMs through multilingual speech recognition and synthesis tasks.<n>We develop an effective code-switched (CS) data construction strategy that splits and splits words from different monolingual speech corpora to equip LLMs with improved CS TTS ability.
arXiv Detail & Related papers (2024-09-17T08:11:07Z)
Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters [3.7273829129985305]
This paper explores integration of graph knowledge from linguistic into multilingual Large Models (LLMs)<n>We employ language-specific adapters to improve performance for low-resource languages (LRLs) in sentiment analysis (SA) and named entity recognition (NER)<n>We assess how structured graph knowledge affects the performance of multilingual LLMs for LRLs in SA and NER.
arXiv Detail & Related papers (2024-07-01T15:56:24Z)
Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora. Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z)
Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods. Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.