CoLI-Machine Learning Approaches for Code-mixed Language Identification
at the Word Level in Kannada-English Texts
- URL: http://arxiv.org/abs/2211.09847v1
- Date: Thu, 17 Nov 2022 19:16:56 GMT
- Title: CoLI-Machine Learning Approaches for Code-mixed Language Identification
at the Word Level in Kannada-English Texts
- Authors: H.L. Shashirekha and F. Balouchzahi and M.D. Anusha and G. Sidorov
- Abstract summary: Many Indians especially youths are comfortable with Hindi and English, in addition to their local languages. Hence, they often use more than one language to post their comments on social media.
Code-mixed Kn-En texts are extracted from YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En embedding.
The words in CoLI-Kenglish dataset are grouped into six major categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location" and "Other.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of automatically identifying a language used in a given text is
called Language Identification (LI). India is a multilingual country and many
Indians especially youths are comfortable with Hindi and English, in addition
to their local languages. Hence, they often use more than one language to post
their comments on social media. Texts containing more than one language are
called "code-mixed texts" and are a good source of input for LI. Languages in
these texts may be mixed at sentence level, word level or even at sub-word
level. LI at word level is a sequence labeling problem where each and every
word in a sentence is tagged with one of the languages in the predefined set of
languages. In order to address word level LI in code-mixed Kannada-English
(Kn-En) texts, this work presents i) the construction of code-mixed Kn-En
dataset called CoLI-Kenglish dataset, ii) code-mixed Kn-En embedding and iii)
learning models using Machine Learning (ML), Deep Learning (DL) and Transfer
Learning (TL) approaches. Code-mixed Kn-En texts are extracted from Kannada
YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En
embedding. The words in CoLI-Kenglish dataset are grouped into six major
categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location"
and "Other". The learning models, namely, CoLI-vectors and CoLI-ngrams based on
ML, CoLI-BiLSTM based on DL and CoLI-ULMFiT based on TL approaches are built
and evaluated using CoLI-Kenglish dataset. The performances of the learning
models illustrated, the superiority of CoLI-ngrams model, compared to other
models with a macro average F1-score of 0.64. However, the results of all the
learning models were quite competitive with each other.
Related papers
- Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z) - Evaluating Input Representation for Language Identification in
Hindi-English Code Mixed Text [4.4904382374090765]
Code-mixed text comprises text written in more than one language.
People naturally tend to combine local language with global languages like English.
In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text.
arXiv Detail & Related papers (2020-11-23T08:08:09Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.