Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies
- URL: http://arxiv.org/abs/2507.13875v1
- Date: Fri, 18 Jul 2025 12:54:41 GMT
- Title: Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies
- Authors: Carlos Mena, Pol Serra, Jacobo Romero, Abir Messaoudi, Jose Giraldo, Carme Armentano-Oller, Rodolfo Zevallos, Ivan Meza, Javier Hernando,
- Abstract summary: Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities.<n>We improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens.<n>Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.
- Score: 9.224033819309708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities. The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI's Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.
Related papers
- Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages [3.263178944046948]
Code-switching (CS) presents challenges for ASR due to scarce and costly transcribed data.<n>We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns.
arXiv Detail & Related papers (2025-06-17T04:37:16Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.<n>Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.<n>We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification [44.94458898538114]
We propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC)
It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID)
Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference.
arXiv Detail & Related papers (2024-02-20T02:04:38Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Code-Switching Text Augmentation for Multilingual Speech Processing [36.302629721413155]
Code-switching in spoken content has enforced ASR systems to handle mixed input.
Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena.
We propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules.
arXiv Detail & Related papers (2022-01-07T17:14:19Z) - Mandarin-English Code-switching Speech Recognition with Self-supervised
Speech Representation Models [55.82292352607321]
Code-switching (CS) is common in daily conversations where more than one language is used within a sentence.
This paper uses the recently successful self-supervised learning (SSL) methods to leverage many unlabeled speech data without CS.
arXiv Detail & Related papers (2021-10-07T14:43:35Z) - Style Variation as a Vantage Point for Code-Switching [54.34370423151014]
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities.
We present a novel vantage point of CS to be style variations between both the participating languages.
We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences.
arXiv Detail & Related papers (2020-05-01T15:53:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.