IIITDWD-ShankarB@ Dravidian-CodeMixi-HASOC2021: mBERT based model for
identification of offensive content in south Indian languages
- URL: http://arxiv.org/abs/2204.10195v1
- Date: Wed, 13 Apr 2022 06:24:57 GMT
- Title: IIITDWD-ShankarB@ Dravidian-CodeMixi-HASOC2021: mBERT based model for
identification of offensive content in south Indian languages
- Authors: Shankar Biradar, Sunil Saumya
- Abstract summary: Task 1 involves identifying offensive content in Malayalam data; Task 2 includes Malayalam and Tamil Code Mixed Sentences.
Our team participated in Task 2.
In our suggested model, we experiment with multilingual BERT to extract features, and three different classifiers are used on extracted features.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, there has been a lot of focus on offensive content. The
amount of offensive content generated by social media is increasing at an
alarming rate. This created a greater need to address this issue than ever
before. To address these issues, the organizers of "Dravidian-Code Mixed
HASOC-2020" have created two challenges. Task 1 involves identifying offensive
content in Malayalam data, whereas Task 2 includes Malayalam and Tamil Code
Mixed Sentences. Our team participated in Task 2. In our suggested model, we
experiment with multilingual BERT to extract features, and three different
classifiers are used on extracted features. Our model received a weighted F1
score of 0.70 for Malayalam data and was ranked fifth; we also received a
weighted F1 score of 0.573 for Tamil Code Mixed data and were ranked eleventh.
Related papers
- ThangDLU at #SMM4H 2024: Encoder-decoder models for classifying text data on social disorders in children and adolescents [49.00494558898933]
This paper describes our participation in Task 3 and Task 5 of the #SMM4H (Social Media Mining for Health) 2024 Workshop.
Task 3 is a multi-class classification task centered on tweets discussing the impact of outdoor environments on symptoms of social anxiety.
Task 5 involves a binary classification task focusing on tweets reporting medical disorders in children.
We applied transfer learning from pre-trained encoder-decoder models such as BART-base and T5-small to identify the labels of a set of given tweets.
arXiv Detail & Related papers (2024-04-30T17:06:20Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Tencent's Multilingual Machine Translation System for WMT22 Large-Scale
African Languages [47.06332023467713]
This paper describes Tencent's multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages.
We adopt data augmentation, distributionally robust optimization, and language family grouping, respectively, to develop our multilingual neural machine translation (MNMT) models.
arXiv Detail & Related papers (2022-10-18T07:22:29Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for
Offensive Language Identification in Tanglish [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC2021: Hate Speech and Offensive Language Identification in Dravidian languages.
This task aims to identify offensive content in code-mixed comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2021-10-06T15:23:40Z) - KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for
Detection of Hate Speech and Offensive Code-Mixed Social Media text [1.0499611180329804]
This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European languages.
The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers.
The best performing classification models developed for both languages are applied on test datasets.
arXiv Detail & Related papers (2021-02-19T11:08:02Z) - indicnlp@kgp at DravidianLangTech-EACL2021: Offensive Language
Identification in Dravidian Languages [0.0]
The paper presents the submission of the team indicnlp@kgp to the EACL 2021 shared task "Offensive Language Identification in Dravidian languages"
The task aimed to classify different offensive content types in 3 code-mixed Dravidian language datasets.
We achieved weighted-average F1 scores of 0.97, 0.77, and 0.72 in the Malayalam-English, Tamil-English, and Kannada-English datasets.
arXiv Detail & Related papers (2021-02-14T13:24:01Z) - WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language
Identification in Code-switched YouTube Comments [16.938836887702923]
This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European languages task 2020.
The HASOC 2020 organizers provided participants with datasets containing social media posts of code-mixed in Dravidian languages (Malayalam-English and Tamil-English)
Our system achieved 0.89 weighted average F1 score for the test set and it ranked 5th place out of 12 participants.
arXiv Detail & Related papers (2020-11-01T16:52:08Z) - Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on
Synthetically Generated Code-Mixed Data for Hate Speech Detection [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC 2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English)
The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2020-10-05T15:25:47Z) - Enhanced Universal Dependency Parsing with Second-Order Inference and
Mixture of Training Data [48.8386313914471]
This paper presents the system used in our submission to the textitIWPT 2020 Shared Task.
For the low-resource Tamil corpus, we specially mixed the training data of Tamil with other languages and significantly improved the performance of Tamil.
arXiv Detail & Related papers (2020-06-02T06:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.