Code-Switching Text Augmentation for Multilingual Speech Processing
- URL: http://arxiv.org/abs/2201.02550v1
- Date: Fri, 7 Jan 2022 17:14:19 GMT
- Title: Code-Switching Text Augmentation for Multilingual Speech Processing
- Authors: Amir Hussein, Shammur Absar Chowdhury, Ahmed Abdelali, Najim Dehak,
Ahmed Ali
- Abstract summary: Code-switching in spoken content has enforced ASR systems to handle mixed input.
Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena.
We propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules.
- Score: 36.302629721413155
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The pervasiveness of intra-utterance Code-switching (CS) in spoken content
has enforced ASR systems to handle mixed input. Yet, designing a CS-ASR has
many challenges, mainly due to the data scarcity, grammatical structure
complexity, and mismatch along with unbalanced language usage distribution.
Recent ASR studies showed the predominance of E2E-ASR using multilingual data
to handle CS phenomena with little CS data. However, the dependency on the CS
data still remains. In this work, we propose a methodology to augment the
monolingual data for artificially generating spoken CS text to improve
different speech modules. We based our approach on Equivalence Constraint
theory while exploiting aligned translation pairs, to generate grammatically
valid CS content. Our empirical results show a relative gain of 29-34 % in
perplexity and around 2% in WER for two ecological and noisy CS test sets.
Finally, the human evaluation suggests that 83.8% of the generated data is
acceptable to humans.
Related papers
- Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data [21.240439045909724]
Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP)
This paper presents a novel methodology to generate CS data using Large Language Models (LLMs)
We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS.
arXiv Detail & Related papers (2025-02-18T15:04:13Z) - AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR [1.8533128809847572]
Intra-sentential code-switching is a significant challenge for Automatic Speech Recognition systems.
AdaCS is a normalization model integrates an adaptive bias attention module into encoder-decoder network.
Experiments show that AdaCS outperforms previous state-of-the-art method on Vietnamese CS ASR normalization.
arXiv Detail & Related papers (2025-01-13T07:27:00Z) - ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings [4.68732641979009]
This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance.
We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities.
We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges.
arXiv Detail & Related papers (2024-08-28T11:27:21Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Investigating Lexical Replacements for Arabic-English Code-Switched Data
Augmentation [32.885722714728765]
We investigate data augmentation techniques for code-switching (CS) NLP systems.
We perform lexical replacements using word-aligned parallel corpora.
We compare these approaches against dictionary-based replacements.
arXiv Detail & Related papers (2022-05-25T10:44:36Z) - Reducing language context confusion for end-to-end code-switching
automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model.
By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z) - Style Variation as a Vantage Point for Code-Switching [54.34370423151014]
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities.
We present a novel vantage point of CS to be style variations between both the participating languages.
We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences.
arXiv Detail & Related papers (2020-05-01T15:53:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.