HASOCOne@FIRE-HASOC2020: Using BERT and Multilingual BERT models for
Hate Speech Detection
- URL: http://arxiv.org/abs/2101.09007v1
- Date: Fri, 22 Jan 2021 08:55:32 GMT
- Title: HASOCOne@FIRE-HASOC2020: Using BERT and Multilingual BERT models for
Hate Speech Detection
- Authors: Suman Dowlagar, Radhika Mamidi
- Abstract summary: We propose an approach to automatically classify hate speech and offensive content.
We have used the datasets obtained from FIRE 2019 and 2020 shared tasks.
We observed that the pre-trained BERT model and the multilingual-BERT model gave the best results.
- Score: 9.23545668304066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hateful and Toxic content has become a significant concern in today's world
due to an exponential rise in social media. The increase in hate speech and
harmful content motivated researchers to dedicate substantial efforts to the
challenging direction of hateful content identification. In this task, we
propose an approach to automatically classify hate speech and offensive
content. We have used the datasets obtained from FIRE 2019 and 2020 shared
tasks. We perform experiments by taking advantage of transfer learning models.
We observed that the pre-trained BERT model and the multilingual-BERT model
gave the best results. The code is made publically available at
https://github.com/suman101112/hasoc-fire-2020.
Related papers
- Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - Revisiting Hate Speech Benchmarks: From Data Curation to System
Deployment [26.504056750529124]
We present GOTHate, a large-scale code-mixed crowdsourced dataset of around 51k posts for hate speech detection from Twitter.
We benchmark it with 10 recent baselines and investigate how adding endogenous signals enhances the hate speech detection task.
Our solution HEN-mBERT is a modular, multilingual, mixture-of-experts model that enriches the linguistic subspace with latent endogenous signals.
arXiv Detail & Related papers (2023-06-01T19:36:52Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Spread Love Not Hate: Undermining the Importance of Hateful Pre-training
for Hate Speech Detection [0.7874708385247353]
We study the effects of hateful pre-training on low resource hate speech classification tasks.
We evaluate different variations of tweet based BERT models pre-trained on hateful, non-hateful and mixed subsets of 40M tweet dataset.
We show that pre-training on non-hateful text from target domain provides similar or better results.
arXiv Detail & Related papers (2022-10-09T13:53:06Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Probabilistic Impact Score Generation using Ktrain-BERT to Identify Hate
Words from Twitter Discussions [0.5735035463793008]
This paper presents experimentation with a Keras wrapped lightweight BERT model to successfully identify hate speech.
The dataset used for this task is the Hate Speech and Offensive Content Detection (HASOC 2021) data from FIRE 2021 in English.
Our system obtained a validation accuracy of 82.60%, with a maximum F1-Score of 82.68%.
arXiv Detail & Related papers (2021-11-25T06:35:49Z) - Detection of Hate Speech using BERT and Hate Speech Word Embedding with
Deep Model [0.5801044612920815]
This paper investigates the feasibility of leveraging domain-specific word embedding in Bidirectional LSTM based deep model to automatically detect/classify hate speech.
The experiments showed that domainspecific word embedding with the Bidirectional LSTM based deep model achieved a 93% f1-score while BERT achieved up to 96% f1-score on a combined balanced dataset from available hate speech datasets.
arXiv Detail & Related papers (2021-11-02T11:42:54Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Offensive Language and Hate Speech Detection with Deep Learning and
Transfer Learning [1.77356577919977]
We propose an approach to automatically classify tweets into three classes: Hate, offensive and Neither.
We create a class module which contains main functionality including text classification, sentiment checking and text data augmentation.
arXiv Detail & Related papers (2021-08-06T20:59:47Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.