Cross-lingual hate speech detection based on multilingual
domain-specific word embeddings
- URL: http://arxiv.org/abs/2104.14728v1
- Date: Fri, 30 Apr 2021 02:24:50 GMT
- Title: Cross-lingual hate speech detection based on multilingual
domain-specific word embeddings
- Authors: Aym\'e Arango, Jorge P\'erez and Barbara Poblete
- Abstract summary: We propose to address the problem of multilingual hate speech detection from the perspective of transfer learning.
Our goal is to determine if knowledge from one particular language can be used to classify other language.
We show that the use of our simple yet specific multilingual hate representations improves classification results.
- Score: 4.769747792846004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic hate speech detection in online social networks is an important
open problem in Natural Language Processing (NLP). Hate speech is a
multidimensional issue, strongly dependant on language and cultural factors.
Despite its relevance, research on this topic has been almost exclusively
devoted to English. Most supervised learning resources, such as labeled
datasets and NLP tools, have been created for this same language. Considering
that a large portion of users worldwide speak in languages other than English,
there is an important need for creating efficient approaches for multilingual
hate speech detection. In this work we propose to address the problem of
multilingual hate speech detection from the perspective of transfer learning.
Our goal is to determine if knowledge from one particular language can be used
to classify other language, and to determine effective ways to achieve this. We
propose a hate specific data representation and evaluate its effectiveness
against general-purpose universal representations most of which, unlike our
proposed model, have been trained on massive amounts of data. We focus on a
cross-lingual setting, in which one needs to classify hate speech in one
language without having access to any labeled data for that language. We show
that the use of our simple yet specific multilingual hate representations
improves classification results. We explain this with a qualitative analysis
showing that our specific representation is able to capture some common
patterns in how hate speech presents itself in different languages.
Our proposal constitutes, to the best of our knowledge, the first attempt for
constructing multilingual specific-task representations. Despite its
simplicity, our model outperformed the previous approaches for most of the
experimental setups. Our findings can orient future solutions toward the use of
domain-specific representations.
Related papers
- LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate
Speech Identification [2.048680519934008]
We present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages.
This paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages.
arXiv Detail & Related papers (2023-04-03T12:03:45Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Data-Efficient Strategies for Expanding Hate Speech Detection into
Under-Resourced Languages [35.185808055004344]
Most hate speech datasets so far focus on English-language content.
More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators.
We explore data-efficient strategies for expanding hate speech detection into under-resourced languages.
arXiv Detail & Related papers (2022-10-20T15:49:00Z) - Highly Generalizable Models for Multilingual Hate Speech Detection [0.0]
Hate speech detection has become an important research topic within the past decade.
We compile a dataset of 11 languages and resolve different by analyzing the combined data with binary labels: hate speech or not hate speech.
We conduct three types of experiments for a binary hate speech classification task: Multilingual-Train Monolingual-Test, MonolingualTrain Monolingual-Test and Language-Family-Train Monolingual Test scenarios.
arXiv Detail & Related papers (2022-01-27T03:09:38Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Cross-lingual Capsule Network for Hate Speech Detection in Social Media [6.531659195805749]
We investigate the cross-lingual hate speech detection task, tackling the problem by adapting the hate speech resources from one language to another.
We propose a cross-lingual capsule network learning model coupled with extra domain-specific lexical semantics for hate speech.
Our model achieves state-of-the-art performance on benchmark datasets from AMI@Evalita 2018 and AMI@Ibereval 2018.
arXiv Detail & Related papers (2021-08-06T12:53:41Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.