Abusive Language Detection in Heterogeneous Contexts: Dataset Collection
and the Role of Supervised Attention
- URL: http://arxiv.org/abs/2105.11119v1
- Date: Mon, 24 May 2021 06:50:19 GMT
- Title: Abusive Language Detection in Heterogeneous Contexts: Dataset Collection
and the Role of Supervised Attention
- Authors: Hongyu Gong, Alberto Valido, Katherine M. Ingram, Giulia Fanti, Suma
Bhat, Dorothy L. Espelage
- Abstract summary: Abusive language is a massive problem in online social platforms.
We provide an annotated dataset of abusive language in over 11,000 comments from YouTube.
We propose an algorithm that uses a supervised attention mechanism to detect and categorize abusive content.
- Score: 9.597481034467915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Abusive language is a massive problem in online social platforms. Existing
abusive language detection techniques are particularly ill-suited to comments
containing heterogeneous abusive language patterns, i.e., both abusive and
non-abusive parts. This is due in part to the lack of datasets that explicitly
annotate heterogeneity in abusive language. We tackle this challenge by
providing an annotated dataset of abusive language in over 11,000 comments from
YouTube. We account for heterogeneity in this dataset by separately annotating
both the comment as a whole and the individual sentences that comprise each
comment. We then propose an algorithm that uses a supervised attention
mechanism to detect and categorize abusive content using multi-task learning.
We empirically demonstrate the challenges of using traditional techniques on
heterogeneous content and the comparative gains in performance of the proposed
approach over state-of-the-art methods.
Related papers
- Breaking the Silence Detecting and Mitigating Gendered Abuse in Hindi, Tamil, and Indian English Online Spaces [0.6543929004971272]
Team CNLP-NITS-PP developed an ensemble approach combining CNN and BiLSTM networks.
CNN captures localized features indicative of abusive language through its convolution filters applied on embedded input text.
BiLSTM analyzes this sequence for dependencies among words and phrases.
validation scores showed strong performance across f1-measures, especially for English 0.84.
arXiv Detail & Related papers (2024-04-02T14:55:47Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have [58.23138483086277]
In this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection.
Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain.
Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages.
arXiv Detail & Related papers (2023-05-23T14:04:12Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - Enriching Abusive Language Detection with Community Context [0.3708656266586145]
Use of pejorative expressions can be benign or actively empowering.
Models for abuse detection misclassify these expressions as derogatory, inadvertently censor productive conversations held by marginalized groups.
Our paper highlights how community context can improve classification outcomes in abusive language detection.
arXiv Detail & Related papers (2022-06-16T20:54:02Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Abuse is Contextual, What about NLP? The Role of Context in Abusive
Language Annotation and Detection [2.793095554369281]
We investigate what happens when the hateful content of a message is judged also based on the context.
We first re-annotate part of a widely used dataset for abusive language detection in English in two conditions, i.e. with and without context.
arXiv Detail & Related papers (2021-03-27T14:31:52Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - "To Target or Not to Target": Identification and Analysis of Abusive
Text Using Ensemble of Classifiers [18.053219155702465]
We present an ensemble learning method to identify and analyze abusive and hateful content on social media platforms.
Our stacked ensemble comprises of three machine learning models that capture different aspects of language and provide diverse and coherent insights about inappropriate language.
arXiv Detail & Related papers (2020-06-05T06:59:22Z) - Joint Modelling of Emotion and Abusive Language Detection [26.18171134454037]
We present the first joint model of emotion and abusive language detection, experimenting in a multi-task learning framework.
Our results demonstrate that incorporating affective features leads to significant improvements in abuse detection performance across datasets.
arXiv Detail & Related papers (2020-05-28T14:08:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.