Computational Approaches to Arabic-English Code-Switching
- URL: http://arxiv.org/abs/2410.13318v1
- Date: Thu, 17 Oct 2024 08:20:29 GMT
- Title: Computational Approaches to Arabic-English Code-Switching
- Authors: Caroline Sabty,
- Abstract summary: We propose and apply state-of-the-art techniques for Modern Standard Arabic and Arabic-English NER tasks.
We have created the first annotated CS Arabic-English corpus for the NER task.
All methods showed improvements in the performance of the NER taggers on CS data.
- Score: 0.0
- License:
- Abstract: Natural Language Processing (NLP) is a vital computational method for addressing language processing, analysis, and generation. NLP tasks form the core of many daily applications, from automatic text correction to speech recognition. While significant research has focused on NLP tasks for the English language, less attention has been given to Modern Standard Arabic and Dialectal Arabic. Globalization has also contributed to the rise of Code-Switching (CS), where speakers mix languages within conversations and even within individual words (intra-word CS). This is especially common in Arab countries, where people often switch between dialects or between dialects and a foreign language they master. CS between Arabic and English is frequent in Egypt, especially on social media. Consequently, a significant amount of code-switched content can be found online. Such code-switched data needs to be investigated and analyzed for several NLP tasks to tackle the challenges of this multilingual phenomenon and Arabic language challenges. No work has been done before for several integral NLP tasks on Arabic-English CS data. In this work, we focus on the Named Entity Recognition (NER) task and other tasks that help propose a solution for the NER task on CS data, e.g., Language Identification. This work addresses this gap by proposing and applying state-of-the-art techniques for Modern Standard Arabic and Arabic-English NER. We have created the first annotated CS Arabic-English corpus for the NER task. Also, we apply two enhancement techniques to improve the NER tagger on CS data using CS contextual embeddings and data augmentation techniques. All methods showed improvements in the performance of the NER taggers on CS data. Finally, we propose several intra-word language identification approaches to determine the language type of a mixed text and identify whether it is a named entity or not.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Code-Switched Language Identification is Harder Than You Think [69.63439391717691]
Code switching is a common phenomenon in written and spoken communication.
We look at the application of building CS corpora.
We make the task more realistic by scaling it to more languages.
We reformulate the task as a sentence-level multi-label tagging problem to make it more tractable.
arXiv Detail & Related papers (2024-02-02T15:38:47Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Improving Natural Language Inference in Arabic using Transformer Models
and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP)
To overcome this limitation, we create a dedicated data set from publicly available resources.
We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z) - Enhancing Low Resource NER Using Assisting Language And Transfer
Learning [0.7340017786387767]
We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model.
We show that models trained using multiple languages perform better than a single language.
arXiv Detail & Related papers (2023-06-10T16:31:04Z) - Simple yet Effective Code-Switching Language Identification with
Multitask Pre-Training and Transfer Learning [0.7242530499990028]
Code-switching is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance.
We propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset.
Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
arXiv Detail & Related papers (2023-05-31T11:43:16Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed
Language Representation [18.136640008855117]
We propose HIT, a robust representation learning method for code-mixed texts.
HIT is a hierarchical transformer-based framework that captures the semantic relationship among words.
Our evaluation of HIT on one European (Spanish) and five Indic (Hindi, Bengali, Tamil, Telugu, and Malayalam) languages suggests significant performance improvement against various state-of-the-art systems.
arXiv Detail & Related papers (2021-05-30T18:53:33Z) - LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation [13.947879344871442]
We propose a benchmark for Linguistic Code-switching Evaluation (LinCE)
LinCE combines ten corpora covering four different code-switched language pairs.
We provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT.
arXiv Detail & Related papers (2020-05-09T00:00:08Z) - Style Variation as a Vantage Point for Code-Switching [54.34370423151014]
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities.
We present a novel vantage point of CS to be style variations between both the participating languages.
We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences.
arXiv Detail & Related papers (2020-05-01T15:53:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.