Deep Learning Models for Multilingual Hate Speech Detection
- URL: http://arxiv.org/abs/2004.06465v3
- Date: Wed, 9 Dec 2020 05:48:56 GMT
- Title: Deep Learning Models for Multilingual Hate Speech Detection
- Authors: Sai Saketh Aluru, Binny Mathew, Punyajoy Saha, and Animesh Mukherjee
- Abstract summary: In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources.
We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best.
In case of zero-shot classification, languages such as Italian and Portuguese achieve good results.
- Score: 5.977278650516324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hate speech detection is a challenging problem with most of the datasets
available in only one language: English. In this paper, we conduct a large
scale analysis of multilingual hate speech in 9 languages from 16 different
sources. We observe that in low resource setting, simple models such as LASER
embedding with logistic regression performs the best, while in high resource
setting BERT based models perform better. In case of zero-shot classification,
languages such as Italian and Portuguese achieve good results. Our proposed
framework could be used as an efficient solution for low-resource languages.
These models could also act as good baselines for future multilingual hate
speech detection tasks. We have made our code and experimental settings public
for other researchers at https://github.com/punyajoy/DE-LIMIT.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Data-Efficient Strategies for Expanding Hate Speech Detection into
Under-Resourced Languages [35.185808055004344]
Most hate speech datasets so far focus on English-language content.
More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators.
We explore data-efficient strategies for expanding hate speech detection into under-resourced languages.
arXiv Detail & Related papers (2022-10-20T15:49:00Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Cross-lingual hate speech detection based on multilingual
domain-specific word embeddings [4.769747792846004]
We propose to address the problem of multilingual hate speech detection from the perspective of transfer learning.
Our goal is to determine if knowledge from one particular language can be used to classify other language.
We show that the use of our simple yet specific multilingual hate representations improves classification results.
arXiv Detail & Related papers (2021-04-30T02:24:50Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Improved acoustic word embeddings for zero-resource languages using
multilingual transfer [37.78342106714364]
We train a single supervised embedding model on labelled data from multiple well-resourced languages and apply it to unseen zero-resource languages.
We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs.
All of these models outperform state-of-the-art unsupervised models trained on the zero-resource languages themselves, giving relative improvements of more than 30% in average precision.
arXiv Detail & Related papers (2020-06-02T12:28:34Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.