Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts
- URL: http://arxiv.org/abs/2211.14459v1
- Date: Sat, 26 Nov 2022 02:39:19 GMT
- Title: Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts
- Authors: Atnafu Lambebo Tonja, Mesay Gemeda Yigezu, Olga Kolesnikova, Moein
Shahiki Tash, Grigori Sidorov, Alexander Gelbuk
- Abstract summary: We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
- Score: 55.41644538483948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using code-mixed data in natural language processing (NLP) research currently
gets a lot of attention. Language identification of social media code-mixed
text has been an interesting problem of study in recent years due to the
advancement and influences of social media in communication. This paper
presents the Instituto Polit\'ecnico Nacional, Centro de Investigaci\'on en
Computaci\'on (CIC) team's system description paper for the CoLI-Kanglish
shared task at ICON2022. In this paper, we propose the use of a Transformer
based model for word-level language identification in code-mixed Kannada
English texts. The proposed model on the CoLI-Kenglish dataset achieves a
weighted F1-score of 0.84 and a macro F1-score of 0.61.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - BJTU-WeChat's Systems for the WMT22 Chat Translation Task [66.81525961469494]
This paper introduces the joint submission of the Beijing Jiaotong University and WeChat AI to the WMT'22 chat translation task for English-German.
Based on the Transformer, we apply several effective variants.
Our systems achieve 0.810 and 0.946 COMET scores.
arXiv Detail & Related papers (2022-11-28T02:35:04Z) - CoLI-Machine Learning Approaches for Code-mixed Language Identification
at the Word Level in Kannada-English Texts [0.0]
Many Indians especially youths are comfortable with Hindi and English, in addition to their local languages. Hence, they often use more than one language to post their comments on social media.
Code-mixed Kn-En texts are extracted from YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En embedding.
The words in CoLI-Kenglish dataset are grouped into six major categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location" and "Other.
arXiv Detail & Related papers (2022-11-17T19:16:56Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z) - Evaluating Input Representation for Language Identification in
Hindi-English Code Mixed Text [4.4904382374090765]
Code-mixed text comprises text written in more than one language.
People naturally tend to combine local language with global languages like English.
In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text.
arXiv Detail & Related papers (2020-11-23T08:08:09Z) - Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on
Synthetically Generated Code-Mixed Data for Hate Speech Detection [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC 2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English)
The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2020-10-05T15:25:47Z) - NLP-CIC at SemEval-2020 Task 9: Analysing sentiment in code-switching
language using a simple deep-learning classifier [63.137661897716555]
Code-switching is a phenomenon in which two or more languages are used in the same message.
We use a standard convolutional neural network model to predict the sentiment of tweets in a blend of Spanish and English languages.
arXiv Detail & Related papers (2020-09-07T19:57:09Z) - C1 at SemEval-2020 Task 9: SentiMix: Sentiment Analysis for Code-Mixed
Social Media Text using Feature Engineering [0.9646922337783134]
This paper describes our feature engineering approach to sentiment analysis in code-mixed social media text for SemEval-2020 Task 9: SentiMix.
We are able to obtain a weighted F1 score of 0.65 for the "Hinglish" task and 0.63 for the "Spanglish" tasks.
arXiv Detail & Related papers (2020-08-09T00:46:26Z) - ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention
Model for Sentiment Analysis in Code-Mixed Text [1.4926515182392508]
We present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contributed to SemEval 2020 Task 9 SentiMix.
The system aims to predict the sentiments of the given English-Hindi code-mixed tweets without using word-level language tags.
arXiv Detail & Related papers (2020-07-27T23:58:54Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.