Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling
- URL: http://arxiv.org/abs/2108.12177v1
- Date: Fri, 27 Aug 2021 08:43:08 GMT
- Title: Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling
- Authors: Adeep Hande, Karthik Puranik, Konthala Yasaswini, Ruba Priyadharshini,
Sajeetha Thavareesan, Anbukkarasi Sampath, Kogilavani Shanmugavadivel,
Durairaj Thenmozhi, Bharathi Raja Chakravarthi
- Abstract summary: We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
- Score: 0.16252563723817934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social media has effectively become the prime hub of communication and
digital marketing. As these platforms enable the free manifestation of thoughts
and facts in text, images and video, there is an extensive need to screen them
to protect individuals and groups from offensive content targeted at them. Our
work intends to classify codemixed social media comments/posts in the Dravidian
languages of Tamil, Kannada, and Malayalam. We intend to improve offensive
language identification by generating pseudo-labels on the dataset. A custom
dataset is constructed by transliterating all the code-mixed texts into the
respective Dravidian language, either Kannada, Malayalam, or Tamil and then
generating pseudo-labels for the transliterated dataset. The two datasets are
combined using the generated pseudo-labels to create a custom dataset called
CMTRA. As Dravidian languages are under-resourced, our approach increases the
amount of training data for the language models. We fine-tune several recent
pretrained language models on the newly constructed dataset. We extract the
pretrained language embeddings and pass them onto recurrent neural networks. We
observe that fine-tuning ULMFiT on the custom dataset yields the best results
on the code-mixed test sets of all three languages. Our approach yields the
best results among the benchmarked models on Tamil-English, achieving a
weighted F1-Score of 0.7934 while scoring competitive weighted F1-Scores of
0.9624 and 0.7306 on the code-mixed test sets of Malayalam-English and
Kannada-English, respectively.
Related papers
- Offensive Language Identification in Transliterated and Code-Mixed
Bangla [29.30985521838655]
In this paper, we explore offensive language identification in texts with transliterations and code-mixing.
We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments.
We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset.
arXiv Detail & Related papers (2023-11-25T13:27:22Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment
analysis of code-mixed text in Dravidian languages [0.0]
This research paper bestows a tiny contribution to this research in the form of sentiment analysis of code-mixed social media comments in the popular Dravidian languages Kannada, Tamil and Malayalam.
It describes the work for the shared task conducted by Dravidian-CodeMix at FIRE 2021 by employing pre-trained models like ULMFiT and multilingual BERT fine-tuned on the code-mixed dataset.
The results are recorded in this research paper where the best models stood 4th, 5th and 10th ranks in the Tamil, Kannada and Malayalam tasks respectively.
arXiv Detail & Related papers (2021-11-15T16:57:59Z) - PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for
Offensive Language Identification in Tanglish [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC2021: Hate Speech and Offensive Language Identification in Dravidian languages.
This task aims to identify offensive content in code-mixed comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2021-10-06T15:23:40Z) - DravidianCodeMix: Sentiment Analysis and Offensive Language
Identification Dataset for Dravidian Languages in Code-Mixed Text [0.9738927161150494]
The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English.
The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha.
arXiv Detail & Related papers (2021-06-17T13:13:26Z) - indicnlp@kgp at DravidianLangTech-EACL2021: Offensive Language
Identification in Dravidian Languages [0.0]
The paper presents the submission of the team indicnlp@kgp to the EACL 2021 shared task "Offensive Language Identification in Dravidian languages"
The task aimed to classify different offensive content types in 3 code-mixed Dravidian language datasets.
We achieved weighted-average F1 scores of 0.97, 0.77, and 0.72 in the Malayalam-English, Tamil-English, and Kannada-English datasets.
arXiv Detail & Related papers (2021-02-14T13:24:01Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.