Word Level Language Identification in English Telugu Code Mixed Data
- URL: http://arxiv.org/abs/2010.04482v1
- Date: Fri, 9 Oct 2020 10:15:06 GMT
- Title: Word Level Language Identification in English Telugu Code Mixed Data
- Authors: Sunil Gundapu, Radhika Mamidi
- Abstract summary: Intrasentential Code Switching (ICS) or Code Mixing (CM) is frequently observed nowadays.
We present a study of various models - Nave Bayes, Random Forest, Conditional Random Field (CRF), and Hidden Markov Model (HMM) for Language Identification.
Our best performing system is CRF-based with an f1-score of 0.91.
- Score: 7.538482310185133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In a multilingual or sociolingual configuration Intra-sentential Code
Switching (ICS) or Code Mixing (CM) is frequently observed nowadays. In the
world, most of the people know more than one language. CM usage is especially
apparent in social media platforms. Moreover, ICS is particularly significant
in the context of technology, health, and law where conveying the upcoming
developments are difficult in one's native language. In applications like
dialog systems, machine translation, semantic parsing, shallow parsing, etc. CM
and Code Switching pose serious challenges. To do any further advancement in
code-mixed data, the necessary step is Language Identification. In this paper,
we present a study of various models - Nave Bayes Classifier, Random Forest
Classifier, Conditional Random Field (CRF), and Hidden Markov Model (HMM) for
Language Identification in English - Telugu Code Mixed Data. Considering the
paucity of resources in code mixed languages, we proposed the CRF model and HMM
model for word level language identification. Our best performing system is
CRF-based with an f1-score of 0.91.
Related papers
- Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - Exploring Multi-Lingual Bias of Large Code Models in Code Generation [55.336629780101475]
Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications.
Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of large code models (LCMs)
LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese.
arXiv Detail & Related papers (2024-04-30T08:51:49Z) - Marathi-English Code-mixed Text Generation [0.0]
Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings.
This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics.
arXiv Detail & Related papers (2023-09-28T06:51:26Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Simple yet Effective Code-Switching Language Identification with
Multitask Pre-Training and Transfer Learning [0.7242530499990028]
Code-switching is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance.
We propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset.
Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
arXiv Detail & Related papers (2023-05-31T11:43:16Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - Reducing language context confusion for end-to-end code-switching
automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model.
By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z) - PESTO: Switching Point based Dynamic and Relative Positional Encoding
for Code-Mixed Languages [1.7073542935233876]
We present our initial observations on applying switching point based positional encoding techniques for CM language.
Results are only marginally better than SOTA, but it is evident that positional encoding could bean effective way to train position sensitive language models for CM text.
arXiv Detail & Related papers (2021-11-12T08:18:21Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.