Related papers: Code-Switching Text Augmentation for Multilingual Speech Processing

Code-Switching Text Augmentation for Multilingual Speech Processing

URL: http://arxiv.org/abs/2201.02550v1
Date: Fri, 7 Jan 2022 17:14:19 GMT
Title: Code-Switching Text Augmentation for Multilingual Speech Processing
Authors: Amir Hussein, Shammur Absar Chowdhury, Ahmed Abdelali, Najim Dehak, Ahmed Ali
Abstract summary: Code-switching in spoken content has enforced ASR systems to handle mixed input. Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena. We propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules.
Score: 36.302629721413155
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The pervasiveness of intra-utterance Code-switching (CS) in spoken content has enforced ASR systems to handle mixed input. Yet, designing a CS-ASR has many challenges, mainly due to the data scarcity, grammatical structure complexity, and mismatch along with unbalanced language usage distribution. Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena with little CS data. However, the dependency on the CS data still remains. In this work, we propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules. We based our approach on Equivalence Constraint theory while exploiting aligned translation pairs, to generate grammatically valid CS content. Our empirical results show a relative gain of 29-34 % in perplexity and around 2% in WER for two ecological and noisy CS test sets. Finally, the human evaluation suggests that 83.8% of the generated data is acceptable to humans.

Related papers

Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies [9.224033819309708]
Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities.<n>We improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens.<n>Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.
arXiv Detail & Related papers (2025-07-18T12:54:41Z)
Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages [3.263178944046948]
Code-switching (CS) presents challenges for ASR due to scarce and costly transcribed data.<n>We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns.
arXiv Detail & Related papers (2025-06-17T04:37:16Z)
Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data [21.240439045909724]
Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP) This paper presents a novel methodology to generate CS data using Large Language Models (LLMs) We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS.
arXiv Detail & Related papers (2025-02-18T15:04:13Z)
AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR [1.8533128809847572]
Intra-sentential code-switching is a significant challenge for Automatic Speech Recognition systems. AdaCS is a normalization model integrates an adaptive bias attention module into encoder-decoder network. Experiments show that AdaCS outperforms previous state-of-the-art method on Vietnamese CS ASR normalization.
arXiv Detail & Related papers (2025-01-13T07:27:00Z)
ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings [4.68732641979009]
This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges.
arXiv Detail & Related papers (2024-08-28T11:27:21Z)
Generative error correction for code-switching speech recognition using large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z)
Speech collage: code-switched audio generation by collaging monolingual corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments. We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z)
Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed. We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z)
Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation [32.885722714728765]
We investigate data augmentation techniques for code-switching (CS) NLP systems. We perform lexical replacements using word-aligned parallel corpora. We compare these approaches against dictionary-based replacements.
arXiv Detail & Related papers (2022-05-25T10:44:36Z)
Reducing language context confusion for end-to-end code-switching automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model. By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z)
Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching [41.88097793717185]
Code-Switching (CS) is a common linguistic phenomenon in multilingual communities. This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech.
arXiv Detail & Related papers (2021-12-19T17:31:15Z)
KARI: KAnari/QCRI's End-to-End systems for the INTERSPEECH 2021 Indian Languages Code-Switching Challenge [7.711092265101041]
We present the Kanari/QCRI system and the modeling strategies used to participate in the Interspeech 2021 Code-switching (CS) challenge for low-resource Indian languages. The subtask involved developing a speech recognition system for two CS datasets: Hindi-English and Bengali-English. To tackle the CS challenges, we use transfer learning for incorporating the publicly available monolingual Hindi, Bengali, and English speech data.
arXiv Detail & Related papers (2021-06-10T16:12:51Z)
Style Variation as a Vantage Point for Code-Switching [54.34370423151014]
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities. We present a novel vantage point of CS to be style variations between both the participating languages. We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences.
arXiv Detail & Related papers (2020-05-01T15:53:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.