Code-Switching Text Augmentation for Multilingual Speech Processing
- URL: http://arxiv.org/abs/2201.02550v1
- Date: Fri, 7 Jan 2022 17:14:19 GMT
- Title: Code-Switching Text Augmentation for Multilingual Speech Processing
- Authors: Amir Hussein, Shammur Absar Chowdhury, Ahmed Abdelali, Najim Dehak,
Ahmed Ali
- Abstract summary: Code-switching in spoken content has enforced ASR systems to handle mixed input.
Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena.
We propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules.
- Score: 36.302629721413155
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The pervasiveness of intra-utterance Code-switching (CS) in spoken content
has enforced ASR systems to handle mixed input. Yet, designing a CS-ASR has
many challenges, mainly due to the data scarcity, grammatical structure
complexity, and mismatch along with unbalanced language usage distribution.
Recent ASR studies showed the predominance of E2E-ASR using multilingual data
to handle CS phenomena with little CS data. However, the dependency on the CS
data still remains. In this work, we propose a methodology to augment the
monolingual data for artificially generating spoken CS text to improve
different speech modules. We based our approach on Equivalence Constraint
theory while exploiting aligned translation pairs, to generate grammatically
valid CS content. Our empirical results show a relative gain of 29-34 % in
perplexity and around 2% in WER for two ecological and noisy CS test sets.
Finally, the human evaluation suggests that 83.8% of the generated data is
acceptable to humans.
Related papers
- ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings [4.68732641979009]
This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance.
We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities.
We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges.
arXiv Detail & Related papers (2024-08-28T11:27:21Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Investigating Lexical Replacements for Arabic-English Code-Switched Data
Augmentation [32.885722714728765]
We investigate data augmentation techniques for code-switching (CS) NLP systems.
We perform lexical replacements using word-aligned parallel corpora.
We compare these approaches against dictionary-based replacements.
arXiv Detail & Related papers (2022-05-25T10:44:36Z) - Reducing language context confusion for end-to-end code-switching
automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model.
By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z) - Integrating Knowledge in End-to-End Automatic Speech Recognition for
Mandarin-English Code-Switching [41.88097793717185]
Code-Switching (CS) is a common linguistic phenomenon in multilingual communities.
This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech.
arXiv Detail & Related papers (2021-12-19T17:31:15Z) - KARI: KAnari/QCRI's End-to-End systems for the INTERSPEECH 2021 Indian
Languages Code-Switching Challenge [7.711092265101041]
We present the Kanari/QCRI system and the modeling strategies used to participate in the Interspeech 2021 Code-switching (CS) challenge for low-resource Indian languages.
The subtask involved developing a speech recognition system for two CS datasets: Hindi-English and Bengali-English.
To tackle the CS challenges, we use transfer learning for incorporating the publicly available monolingual Hindi, Bengali, and English speech data.
arXiv Detail & Related papers (2021-06-10T16:12:51Z) - Style Variation as a Vantage Point for Code-Switching [54.34370423151014]
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities.
We present a novel vantage point of CS to be style variations between both the participating languages.
We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences.
arXiv Detail & Related papers (2020-05-01T15:53:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.