Investigating Lexical Replacements for Arabic-English Code-Switched Data
Augmentation
- URL: http://arxiv.org/abs/2205.12649v2
- Date: Tue, 4 Apr 2023 17:16:59 GMT
- Title: Investigating Lexical Replacements for Arabic-English Code-Switched Data
Augmentation
- Authors: Injy Hamed, Nizar Habash, Slim Abdennadher, Ngoc Thang Vu
- Abstract summary: We investigate data augmentation techniques for code-switching (CS) NLP systems.
We perform lexical replacements using word-aligned parallel corpora.
We compare these approaches against dictionary-based replacements.
- Score: 32.885722714728765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data sparsity is a main problem hindering the development of code-switching
(CS) NLP systems. In this paper, we investigate data augmentation techniques
for synthesizing dialectal Arabic-English CS text. We perform lexical
replacements using word-aligned parallel corpora where CS points are either
randomly chosen or learnt using a sequence-to-sequence model. We compare these
approaches against dictionary-based replacements. We assess the quality of the
generated sentences through human evaluation and evaluate the effectiveness of
data augmentation on machine translation (MT), automatic speech recognition
(ASR), and speech translation (ST) tasks. Results show that using a predictive
model results in more natural CS sentences compared to the random approach, as
reported in human judgements. In the downstream tasks, despite the random
approach generating more data, both approaches perform equally (outperforming
dictionary-based replacements). Overall, data augmentation achieves 34%
improvement in perplexity, 5.2% relative improvement on WER for ASR task,
+4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline
trained on available data without augmentation.
Related papers
- Data Augmentation Techniques for Machine Translation of Code-Switched
Texts: A Comparative Study [37.542853327876074]
We compare three popular approaches: lexical replacements, linguistic theories, and back-translation.
We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks.
arXiv Detail & Related papers (2023-10-23T18:09:41Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - WADER at SemEval-2023 Task 9: A Weak-labelling framework for Data
augmentation in tExt Regression Tasks [4.102007186133394]
In this paper, we propose a novel weak-labeling strategy for data augmentation in text regression tasks called WADER.
We benchmark the performance of State-of-the-Art pre-trained multilingual language models using WADER and analyze the use of sampling techniques to mitigate bias in data.
arXiv Detail & Related papers (2023-03-05T19:45:42Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Code-Switching Text Augmentation for Multilingual Speech Processing [36.302629721413155]
Code-switching in spoken content has enforced ASR systems to handle mixed input.
Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena.
We propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules.
arXiv Detail & Related papers (2022-01-07T17:14:19Z) - Low Resource German ASR with Untranscribed Data Spoken by Non-native
Children -- INTERSPEECH 2021 Shared Task SPAPL System [19.435571932141364]
This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German.
5 hours of transcribed data and 60 hours of untranscribed data are provided to develop a German ASR system for children.
For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances.
Our system achieves a word error rate (WER) of 39.68% on the evaluation data,
arXiv Detail & Related papers (2021-06-18T07:36:26Z) - Consistency Regularization for Cross-Lingual Fine-Tuning [61.08704789561351]
We propose to improve cross-lingual fine-tuning with consistency regularization.
Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations.
Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks.
arXiv Detail & Related papers (2021-06-15T15:35:44Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.