The Unseen Targets of Hate -- A Systematic Review of Hateful Communication Datasets
- URL: http://arxiv.org/abs/2405.08562v1
- Date: Tue, 14 May 2024 12:50:33 GMT
- Title: The Unseen Targets of Hate -- A Systematic Review of Hateful Communication Datasets
- Authors: Zehui Yu, Indira Sen, Dennis Assenmacher, Mattia Samory, Leon Fröhling, Christina Dahn, Debora Nozza, Claudia Wagner,
- Abstract summary: Machine learning tools can only be as capable as the quality of the data they are trained on allows them.
We present a systematic review of the datasets for the automated detection of hateful communication introduced over the past decade.
We find, overall, a skewed representation of selected target identities and mismatches between the targets that research conceptualizes and ultimately includes in datasets.
- Score: 15.593796580973937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML)-based content moderation tools are essential to keep online spaces free from hateful communication. Yet, ML tools can only be as capable as the quality of the data they are trained on allows them. While there is increasing evidence that they underperform in detecting hateful communications directed towards specific identities and may discriminate against them, we know surprisingly little about the provenance of such bias. To fill this gap, we present a systematic review of the datasets for the automated detection of hateful communication introduced over the past decade, and unpack the quality of the datasets in terms of the identities that they embody: those of the targets of hateful communication that the data curators focused on, as well as those unintentionally included in the datasets. We find, overall, a skewed representation of selected target identities and mismatches between the targets that research conceptualizes and ultimately includes in datasets. Yet, by contextualizing these findings in the language and location of origin of the datasets, we highlight a positive trend towards the broadening and diversification of this research space.
Related papers
- An Ensemble Scheme for Proactive Dominant Data Migration of Pervasive Tasks at the Edge [5.4327243200369555]
We propose a scheme to be implemented by autonomous edge nodes concerning their identifications of the appropriate data to be migrated to particular locations within the infrastructure.
Our objective is to equip nodes with the capability to comprehend the access patterns relating to offloaded data-driven tasks.
It is evident that these tasks depend on the processing of data that is absent from the original hosting nodes.
To infer these data intervals, we utilize an ensemble approach that integrates a statistically oriented model and a machine learning framework.
arXiv Detail & Related papers (2024-10-12T19:09:16Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Decoupling the Class Label and the Target Concept in Machine Unlearning [81.69857244976123]
Machine unlearning aims to adjust a trained model to approximate a retrained one that excludes a portion of training data.
Previous studies showed that class-wise unlearning is successful in forgetting the knowledge of a target class.
We propose a general framework, namely, TARget-aware Forgetting (TARF)
arXiv Detail & Related papers (2024-06-12T14:53:30Z) - Data Representativeness in Accessibility Datasets: A Meta-Analysis [7.6597163467929805]
We review datasets sourced by people with disabilities and older adults.
We find that accessibility datasets represent diverse ages, but have gender and race representation gaps.
We hope our effort expands the space of possibility for greater inclusion of marginalized communities in AI-infused systems.
arXiv Detail & Related papers (2022-07-16T23:32:19Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Building Inspection Toolkit: Unified Evaluation and Strong Baselines for
Damage Recognition [0.0]
We introduce the building inspection toolkit -- bikit -- which acts as a simple to use data hub containing relevant open-source datasets in the field of damage recognition.
The datasets are enriched with evaluation splits and predefined metrics, suiting the specific task and their data distribution.
For the sake of compatibility and to motivate researchers in this domain, we also provide a leaderboard and the possibility to share model weights with the community.
arXiv Detail & Related papers (2022-02-14T20:05:59Z) - Causal Scene BERT: Improving object detection by searching for
challenging groups of data [125.40669814080047]
Computer vision applications rely on learning-based perception modules parameterized with neural networks for tasks like object detection.
These modules frequently have low expected error overall but high error on atypical groups of data due to biases inherent in the training process.
Our main contribution is a pseudo-automatic method to discover such groups in foresight by performing causal interventions on simulated scenes.
arXiv Detail & Related papers (2022-02-08T05:14:16Z) - Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating
Toxic Text Datasets [26.486492641924226]
This study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues.
We re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality.
arXiv Detail & Related papers (2021-12-07T06:58:22Z) - Training Dynamic based data filtering may not work for NLP datasets [0.0]
We study the applicability of the Area Under the Margin (AUM) metric to identify mislabelled examples in NLP datasets.
We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points.
arXiv Detail & Related papers (2021-09-19T18:50:45Z) - Towards Unbiased Visual Emotion Recognition via Causal Intervention [63.74095927462]
We propose a novel Emotion Recognition Network (IERN) to alleviate the negative effects brought by the dataset bias.
A series of designed tests validate the effectiveness of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms other state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-26T10:40:59Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.