Overview of the HASOC Subtrack at FIRE 2022: Offensive Language
Identification in Marathi
- URL: http://arxiv.org/abs/2211.10163v1
- Date: Fri, 18 Nov 2022 11:17:15 GMT
- Title: Overview of the HASOC Subtrack at FIRE 2022: Offensive Language
Identification in Marathi
- Authors: Tharindu Ranasinghe, Kai North, Damith Premasiri, Marcos Zampieri
- Abstract summary: The HASOC (Hate Speech and Offensive Content Identification) shared task is one of these initiatives.
In its fourth iteration, HASOC 2022 included three subtracks for English, Hindi, and Marathi.
We report the results of the HASOC 2022 Marathi subtrack which provided participants with a dataset containing data from Twitter manually annotated using the popular OLID taxonomy.
The best performing algorithms were a mixture of traditional and deep learning approaches.
- Score: 15.466844451996051
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread of offensive content online has become a reason for great
concern in recent years, motivating researchers to develop robust systems
capable of identifying such content automatically. With the goal of carrying
out a fair evaluation of these systems, several international competitions have
been organized, providing the community with important benchmark data and
evaluation methods for various languages. Organized since 2019, the HASOC (Hate
Speech and Offensive Content Identification) shared task is one of these
initiatives. In its fourth iteration, HASOC 2022 included three subtracks for
English, Hindi, and Marathi. In this paper, we report the results of the HASOC
2022 Marathi subtrack which provided participants with a dataset containing
data from Twitter manually annotated using the popular OLID taxonomy. The
Marathi track featured three additional subtracks, each corresponding to one
level of the taxonomy: Task A - offensive content identification (offensive vs.
non-offensive); Task B - categorization of offensive types (targeted vs.
untargeted), and Task C - offensive target identification (individual vs. group
vs. others). Overall, 59 runs were submitted by 10 teams. The best systems
obtained an F1 of 0.9745 for Subtrack 3A, an F1 of 0.9207 for Subtrack 3B, and
F1 of 0.9607 for Subtrack 3C. The best performing algorithms were a mixture of
traditional and deep learning approaches.
Related papers
- Tracking Every Thing in the Wild [61.917043381836656]
We introduce a new metric, Track Every Thing Accuracy (TETA), breaking tracking measurement into three sub-factors: localization, association, and classification.
Our experiments show that TETA evaluates trackers more comprehensively, and TETer achieves significant improvements on the challenging large-scale datasets BDD100K and TAO.
arXiv Detail & Related papers (2022-07-26T15:37:19Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - UrduFake@FIRE2021: Shared Track on Fake News Identification in Urdu [55.41644538483948]
This study reports the second shared task named as UrduFake@FIRE2021 on identifying fake news detection in Urdu language.
The proposed systems were based on various count-based features and used different classifiers as well as neural network architectures.
The gradient descent (SGD) algorithm outperformed other classifiers and achieved 0.679 F-score.
arXiv Detail & Related papers (2022-07-11T19:15:04Z) - Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2021 [55.41644538483948]
The goal of the shared task is to motivate the community to come up with efficient methods for solving this vital problem.
The training set contains 1300 annotated news articles -- 750 real news, 550 fake news, while the testing set contains 300 news articles -- 200 real, 100 fake news.
The best performing system obtained an F1-macro score of 0.679, which is lower than the past year's best result of 0.907 F1-macro.
arXiv Detail & Related papers (2022-07-11T18:58:36Z) - Exploiting Semantic Role Contextualized Video Features for
Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022 [72.12974259966592]
We present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022.
We first parse sentences into semantic roles corresponding to verbs and nouns, then utilize self-attentions to exploit semantic role contextualized video features.
arXiv Detail & Related papers (2022-06-29T03:24:43Z) - IIITDWD-ShankarB@ Dravidian-CodeMixi-HASOC2021: mBERT based model for
identification of offensive content in south Indian languages [0.0]
Task 1 involves identifying offensive content in Malayalam data; Task 2 includes Malayalam and Tamil Code Mixed Sentences.
Our team participated in Task 2.
In our suggested model, we experiment with multilingual BERT to extract features, and three different classifiers are used on extracted features.
arXiv Detail & Related papers (2022-04-13T06:24:57Z) - Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive
Content Identification in English and Indo-Aryan Languages [4.267837363677351]
This paper presents the HASOC subtrack for English, Hindi, and Marathi.
The data set was assembled from Twitter.
The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively.
arXiv Detail & Related papers (2021-12-17T03:28:54Z) - Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive
Content Identification in Indo-European Languages [2.927129789938848]
The HASOC track intends to develop and optimize Hate Speech detection algorithms for Hindi, German and English.
The dataset is collected from a Twitter archive and pre-classified by a machine learning system.
Overall, 252 runs were submitted by 40 teams. The performance of the best classification algorithms for task A are F1 measures of 0.51, 0.53 and 0.52 for English, Hindi, and German, respectively.
arXiv Detail & Related papers (2021-08-12T19:02:53Z) - KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for
Detection of Hate Speech and Offensive Code-Mixed Social Media text [1.0499611180329804]
This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European languages.
The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers.
The best performing classification models developed for both languages are applied on test datasets.
arXiv Detail & Related papers (2021-02-19T11:08:02Z) - Garain at SemEval-2020 Task 12: Sequence based Deep Learning for
Categorizing Offensive Language in Social Media [3.236217153362305]
SemEval-2020 Task 12 was OffenseEval: Multilingual Offensive Language Identification in Social Media.
My system on training on 25% of the whole dataset macro averaged f1 score of 47.763%.
arXiv Detail & Related papers (2020-09-02T17:09:29Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.