Multilingual Hate Speech and Offensive Content Detection using Modified
Cross-entropy Loss
- URL: http://arxiv.org/abs/2202.02635v1
- Date: Sat, 5 Feb 2022 20:31:40 GMT
- Title: Multilingual Hate Speech and Offensive Content Detection using Modified
Cross-entropy Loss
- Authors: Arka Mitra, Priyanshu Sankhala
- Abstract summary: Large language models are trained on a lot of data and they also make use of contextual embeddings.
The data is also quite unbalanced; so we used a modified cross-entropy loss to tackle the issue.
Our team (HNLP) achieved the macro F1-scores of 0.808, 0.639 in English Subtask A and English Subtask B respectively.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The number of increased social media users has led to a lot of people
misusing these platforms to spread offensive content and use hate speech.
Manual tracking the vast amount of posts is impractical so it is necessary to
devise automated methods to identify them quickly. Large language models are
trained on a lot of data and they also make use of contextual embeddings. We
fine-tune the large language models to help in our task. The data is also quite
unbalanced; so we used a modified cross-entropy loss to tackle the issue. We
observed that using a model which is fine-tuned in hindi corpora performs
better. Our team (HNLP) achieved the macro F1-scores of 0.808, 0.639 in English
Subtask A and English Subtask B respectively. For Hindi Subtask A, Hindi
Subtask B our team achieved macro F1-scores of 0.737, 0.443 respectively in
HASOC 2021.
Related papers
- HateGPT: Unleashing GPT-3.5 Turbo to Combat Hate Speech on X [0.0]
We evaluate the performance of a classification model using Macro-F1 scores across three distinct runs.
The results suggest that the model consistently performs well in terms of precision and recall, with run 1 showing the highest performance.
arXiv Detail & Related papers (2024-11-14T06:20:21Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive
Content Identification in English and Indo-Aryan Languages [4.267837363677351]
This paper presents the HASOC subtrack for English, Hindi, and Marathi.
The data set was assembled from Twitter.
The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively.
arXiv Detail & Related papers (2021-12-17T03:28:54Z) - Fine-tuning of Pre-trained Transformers for Hate, Offensive, and Profane
Content Detection in English and Marathi [0.0]
This paper describes neural models developed for the Hate Speech and Offensive Content Identification in English and Indo-Aryan languages.
For English subtasks, we investigate the impact of additional corpora for hate speech detection to fine-tune transformer models.
For the Marathi tasks, we propose a system based on the Language-Agnostic BERT Sentence Embedding (LaBSE)
arXiv Detail & Related papers (2021-10-25T07:11:02Z) - One to rule them all: Towards Joint Indic Language Hate Speech Detection [7.296361860015606]
We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection.
On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B.
arXiv Detail & Related papers (2021-09-28T13:30:00Z) - Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive
Content Identification in Indo-European Languages [2.927129789938848]
The HASOC track intends to develop and optimize Hate Speech detection algorithms for Hindi, German and English.
The dataset is collected from a Twitter archive and pre-classified by a machine learning system.
Overall, 252 runs were submitted by 40 teams. The performance of the best classification algorithms for task A are F1 measures of 0.51, 0.53 and 0.52 for English, Hindi, and German, respectively.
arXiv Detail & Related papers (2021-08-12T19:02:53Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.