Related papers: Fine-tuning of Pre-trained Transformers for Hate, Offensive, and Profane Content Detection in English and Marathi

Fine-tuning of Pre-trained Transformers for Hate, Offensive, and Profane Content Detection in English and Marathi

URL: http://arxiv.org/abs/2110.12687v1
Date: Mon, 25 Oct 2021 07:11:02 GMT
Title: Fine-tuning of Pre-trained Transformers for Hate, Offensive, and Profane Content Detection in English and Marathi
Authors: Anna Glazkova, Michael Kadantsev and Maksim Glazkov
Abstract summary: This paper describes neural models developed for the Hate Speech and Offensive Content Identification in English and Indo-Aryan languages. For English subtasks, we investigate the impact of additional corpora for hate speech detection to fine-tune transformer models. For the Marathi tasks, we propose a system based on the Language-Agnostic BERT Sentence Embedding (LaBSE)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper describes neural models developed for the Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages Shared Task 2021. Our team called neuro-utmn-thales participated in two tasks on binary and fine-grained classification of English tweets that contain hate, offensive, and profane content (English Subtasks A & B) and one task on identification of problematic content in Marathi (Marathi Subtask A). For English subtasks, we investigate the impact of additional corpora for hate speech detection to fine-tune transformer models. We also apply a one-vs-rest approach based on Twitter-RoBERTa to discrimination between hate, profane and offensive posts. Our models ranked third in English Subtask A with the F1-score of 81.99% and ranked second in English Subtask B with the F1-score of 65.77%. For the Marathi tasks, we propose a system based on the Language-Agnostic BERT Sentence Embedding (LaBSE). This model achieved the second result in Marathi Subtask A obtaining an F1 of 88.08%.

Related papers

GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human [71.42669028683741]
We present a shared task on binary machine generated text detection conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. We provide a comprehensive overview of the data, a summary of the results, detailed descriptions of the participating systems, and an in-depth analysis of submissions.
arXiv Detail & Related papers (2025-01-19T11:11:55Z)
Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC) We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages. Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z)
The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset. We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model. The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z)
Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language. We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening. For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z)
RuArg-2022: Argument Mining Evaluation [69.87149207721035]
This paper is a report of the organizers on the first competition of argumentation analysis systems dealing with Russian language texts. A corpus containing 9,550 sentences (comments on social media posts) on three topics related to the COVID-19 pandemic was prepared. The system that won the first place in both tasks used the NLI (Natural Language Inference) variant of the BERT architecture.
arXiv Detail & Related papers (2022-06-18T17:13:37Z)
Multilingual Hate Speech and Offensive Content Detection using Modified Cross-entropy Loss [0.0]
Large language models are trained on a lot of data and they also make use of contextual embeddings. The data is also quite unbalanced; so we used a modified cross-entropy loss to tackle the issue. Our team (HNLP) achieved the macro F1-scores of 0.808, 0.639 in English Subtask A and English Subtask B respectively.
arXiv Detail & Related papers (2022-02-05T20:31:40Z)
Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages [4.267837363677351]
This paper presents the HASOC subtrack for English, Hindi, and Marathi. The data set was assembled from Twitter. The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively.
arXiv Detail & Related papers (2021-12-17T03:28:54Z)
Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z)
One to rule them all: Towards Joint Indic Language Hate Speech Detection [7.296361860015606]
We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection. On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B.
arXiv Detail & Related papers (2021-09-28T13:30:00Z)
Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages [2.927129789938848]
The HASOC track intends to develop and optimize Hate Speech detection algorithms for Hindi, German and English. The dataset is collected from a Twitter archive and pre-classified by a machine learning system. Overall, 252 runs were submitted by 40 teams. The performance of the best classification algorithms for task A are F1 measures of 0.51, 0.53 and 0.52 for English, Hindi, and German, respectively.
arXiv Detail & Related papers (2021-08-12T19:02:53Z)
Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models. Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.