Related papers: SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

URL: http://arxiv.org/abs/2506.00087v1
Date: Fri, 30 May 2025 05:54:46 GMT
Title: SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset
Authors: Peng Xie, Xingyuan Liu, Tsz Wai Chan, Yequan Bie, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen,
Abstract summary: Code-Switching (CS) is the alternating use of two or more languages within a conversation or utterance.<n>This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems.<n>textbfSwitchLingua is the first large-scale multilingual and multi-ethnic code-switching dataset.
Score: 34.40254709148148
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance.

Related papers

CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval [0.9023847175654603]
CLASP (Contrastive Language-Speech Pretraining) is a multilingual representation tailored for audio-text information retrieval.<n>During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion.<n>Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics.
arXiv Detail & Related papers (2024-12-17T16:38:10Z)
Developing a Multilingual Dataset and Evaluation Metrics for Code-Switching: A Focus on Hong Kong's Polylingual Dynamics [0.5700195008916903]
We develop a 34.8-hour dataset of Mixed Cantonese and English (MCE) audio using our Multi-Agent Data Generation Framework (MADGF)<n>We fine-tuned the open-source multilingual Automatic Speech Recognition (ASR) model, Whisper, with the MCE dataset, leading to impressive zero-shot performance.
arXiv Detail & Related papers (2023-10-27T08:01:55Z)
Simple yet Effective Code-Switching Language Identification with Multitask Pre-Training and Transfer Learning [0.7242530499990028]
Code-switching is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance. We propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset. Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
arXiv Detail & Related papers (2023-05-31T11:43:16Z)
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z)
LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information. Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z)
Multi-level Contrastive Learning for Cross-lingual Spoken Language Understanding [90.87454350016121]
We develop novel code-switching schemes to generate hard negative examples for contrastive learning at all levels. We develop a label-aware joint model to leverage label semantics for cross-lingual knowledge transfer.
arXiv Detail & Related papers (2022-05-07T13:44:28Z)
Code Switched and Code Mixed Speech Recognition for Indic languages [0.0]
Training multilingual automatic speech recognition (ASR) systems is challenging because acoustic and lexical information is typically language specific. We compare the performance of end to end multilingual speech recognition system to the performance of monolingual models conditioned on language identification (LID) We also propose a similar technique to solve the Code Switched problem and achieve a WER of 21.77 and 28.27 over Hindi-English and Bengali-English respectively.
arXiv Detail & Related papers (2022-03-30T18:09:28Z)
Reducing language context confusion for end-to-end code-switching automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model. By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z)
Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions [0.0]
We show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results in low-resource setting for the language identification task. We also substantiate the hypothesis that whenever the dataset is diverse enough so that the other classification factors, like gender, age etc. are well-averaged, the confusion matrix for LID system bears the language similarity measure.
arXiv Detail & Related papers (2021-05-31T18:35:27Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Multilingual and code-switching ASR challenges for low resource Indian languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages. We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages. We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.