Modeling Profanity and Hate Speech in Social Media with Semantic
Subspaces
- URL: http://arxiv.org/abs/2106.07505v1
- Date: Mon, 14 Jun 2021 15:34:37 GMT
- Title: Modeling Profanity and Hate Speech in Social Media with Semantic
Subspaces
- Authors: Vanessa Hahn, Dana Ruiter, Thomas Kleinbauer, Dietrich Klakow
- Abstract summary: Hate speech and profanity detection suffer from data sparsity, especially for languages other than English.
We identify profane subspaces in word and sentence representations and explore their generalization capability.
We observe that, on both similar and distant target tasks and across all languages, the subspace-based representations transfer more effectively than standard BERT representations.
- Score: 15.457286059556393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hate speech and profanity detection suffer from data sparsity, especially for
languages other than English, due to the subjective nature of the tasks and the
resulting annotation incompatibility of existing corpora. In this study, we
identify profane subspaces in word and sentence representations and explore
their generalization capability on a variety of similar and distant target
tasks in a zero-shot setting. This is done monolingually (German) and
cross-lingually to closely-related (English), distantly-related (French) and
non-related (Arabic) tasks. We observe that, on both similar and distant target
tasks and across all languages, the subspace-based representations transfer
more effectively than standard BERT representations in the zero-shot setting,
with improvements between F1 +10.9 and F1 +42.9 over the baselines across all
tested monolingual and cross-lingual scenarios.
Related papers
- 1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs [0.0]
This paper presents a detailed system description of our entry for the CHiPSAL 2025 shared task.
We focus on language detection, hate speech identification, and target detection in Devanagari script languages.
arXiv Detail & Related papers (2024-11-11T10:34:36Z) - Discovering Low-rank Subspaces for Language-agnostic Multilingual
Representations [38.56175462620892]
Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer.
We present a novel view of projecting away language-specific factors from a multilingual embedding space.
We show that applying our method consistently leads to improvements over commonly used ML-LMs.
arXiv Detail & Related papers (2024-01-11T09:54:11Z) - Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot
Translation [79.96416609433724]
Zero-shot translation (ZST) aims to translate between unseen language pairs in training data.
The common practice to guide the zero-shot language mapping during inference is to deliberately insert the source and target language IDs.
Recent studies have shown that language IDs sometimes fail to navigate the ZST task, making them suffer from the off-target problem.
arXiv Detail & Related papers (2023-09-28T17:02:36Z) - Multilingual Auxiliary Tasks Training: Bridging the Gap between
Languages for Zero-Shot Transfer of Hate Speech Detection Models [3.97478982737167]
We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning.
We propose to train on multilingual auxiliary tasks to improve zero-shot transfer of hate speech detection models across languages.
arXiv Detail & Related papers (2022-10-24T08:26:51Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Cross-lingual Capsule Network for Hate Speech Detection in Social Media [6.531659195805749]
We investigate the cross-lingual hate speech detection task, tackling the problem by adapting the hate speech resources from one language to another.
We propose a cross-lingual capsule network learning model coupled with extra domain-specific lexical semantics for hate speech.
Our model achieves state-of-the-art performance on benchmark datasets from AMI@Evalita 2018 and AMI@Ibereval 2018.
arXiv Detail & Related papers (2021-08-06T12:53:41Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.