Offense Detection in Dravidian Languages using Code-Mixing Index based
Focal Loss
- URL: http://arxiv.org/abs/2111.06916v1
- Date: Fri, 12 Nov 2021 19:50:24 GMT
- Title: Offense Detection in Dravidian Languages using Code-Mixing Index based
Focal Loss
- Authors: Debapriya Tula, Shreyas MS, Viswanatha Reddy, Pranjal Sahu, Sumanth
Doddapaneni, Prathyush Potluri, Rohan Sukumaran, Parth Patwa
- Abstract summary: Complexity of identifying offensive content is exacerbated by the usage of multiple modalities.
Our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code mixed setting.
- Score: 1.7267596343997798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the past decade, we have seen exponential growth in online content
fueled by social media platforms. Data generation of this scale comes with the
caveat of insurmountable offensive content in it. The complexity of identifying
offensive content is exacerbated by the usage of multiple modalities (image,
language, etc.), code mixed language and more. Moreover, even if we carefully
sample and annotate offensive content, there will always exist significant
class imbalance in offensive vs non offensive content. In this paper, we
introduce a novel Code-Mixing Index (CMI) based focal loss which circumvents
two challenges (1) code mixing in languages (2) class imbalance problem for
Dravidian language offense detection. We also replace the conventional dot
product-based classifier with the cosine-based classifier which results in a
boost in performance. Further, we use multilingual models that help transfer
characteristics learnt across languages to work effectively with low resourced
languages. It is also important to note that our model handles instances of
mixed script (say usage of Latin and Dravidian - Tamil script) as well. Our
model can handle offensive language detection in a low-resource, class
imbalanced, multilingual and code mixed setting.
Related papers
- Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Multilingual Text Classification for Dravidian Languages [4.264592074410622]
We propose a multilingual text classification framework for the Dravidian languages.
On the one hand, the framework used the LaBSE pre-trained model as the base model.
On the other hand, in view of the problem that the model cannot well recognize and utilize the correlation among languages, we further proposed a language-specific representation module.
arXiv Detail & Related papers (2021-12-03T04:26:49Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z) - Hate-Alert@DravidianLangTech-EACL2021: Ensembling strategies for
Transformer-based Offensive language Detection [5.139400587753555]
Social media often acts as breeding grounds for different forms of offensive content.
We present an exhaustive exploration of different transformer models, We also provide a genetic algorithm technique for ensembling different models.
Our ensembled models trained separately for each language secured the first position in Tamil, the second position in Kannada, and the first position in Malayalam sub-tasks.
arXiv Detail & Related papers (2021-02-19T18:35:38Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - CUSATNLP@HASOC-Dravidian-CodeMix-FIRE2020:Identifying Offensive Language
from ManglishTweets [0.0]
We present a working model submitted for Task2 of the sub-track HASOC Offensive Language Identification- DravidianCodeMix.
It is a message level classification task.
An embedding model-based classifier identifies offensive and not offensive comments in our approach.
arXiv Detail & Related papers (2020-10-17T10:11:41Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.