A Dataset for the Detection of Dehumanizing Language
- URL: http://arxiv.org/abs/2402.08764v1
- Date: Tue, 13 Feb 2024 19:58:24 GMT
- Title: A Dataset for the Detection of Dehumanizing Language
- Authors: Paul Engelmann, Peter Brunsgaard Trolle, Christian Hardmeier
- Abstract summary: We present two data sets of dehumanizing text, a large, automatically collected corpus and a smaller, manually annotated data set.
Our methods give us a broad and varied amount of dehumanization data to work with, enabling further exploratory analysis and automatic classification of dehumanization patterns.
- Score: 3.2803526084968895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dehumanization is a mental process that enables the exclusion and ill
treatment of a group of people. In this paper, we present two data sets of
dehumanizing text, a large, automatically collected corpus and a smaller,
manually annotated data set. Both data sets include a combination of political
discourse and dialogue from movie subtitles. Our methods give us a broad and
varied amount of dehumanization data to work with, enabling further exploratory
analysis and automatic classification of dehumanization patterns. Both data
sets will be publicly released.
Related papers
- Beyond Hate Speech: NLP's Challenges and Opportunities in Uncovering
Dehumanizing Language [11.946719280041789]
This paper evaluates the performance of cutting-edge NLP models, including GPT-4, GPT-3.5, and LLAMA-2 in identifying dehumanizing language.
Our findings reveal that while these models demonstrate potential, achieving a 70% accuracy rate in distinguishing dehumanizing language from broader hate speech, they also display biases.
arXiv Detail & Related papers (2024-02-21T13:57:36Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona
Dialogue Corpus [13.051107304650627]
Building a natural language dataset requires caution since word semantics is vulnerable to subtle text change or the definition of the annotated concept.
In this study, we tackle these issues when creating a large-scale open-domain persona dialogue corpus.
arXiv Detail & Related papers (2023-04-01T16:10:36Z) - A Comparative Study on Textual Saliency of Styles from Eye Tracking,
Annotations, and Language Models [21.190423578990824]
We present eyeStyliency, an eye-tracking dataset for human processing of stylistic text.
We develop a variety of methods to derive style saliency scores over text using the collected eye dataset.
We find that while eye-tracking data is unique, it also intersects with both human annotations and model-based importance scores.
arXiv Detail & Related papers (2022-12-19T21:50:36Z) - Training Effective Neural Sentence Encoders from Automatically Mined
Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data.
Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora.
Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Does Summary Evaluation Survive Translation to Other Languages? [0.0]
We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
arXiv Detail & Related papers (2021-09-16T17:35:01Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z) - Boosting Semantic Human Matting with Coarse Annotations [66.8725980604434]
coarse annotated human dataset is much easier to acquire and collect from the public dataset.
A matting refinement network takes in the unified mask and the input image to predict the final alpha matte.
arXiv Detail & Related papers (2020-04-10T09:11:02Z) - A Framework for the Computational Linguistic Analysis of Dehumanization [52.735780962665814]
We analyze discussions of LGBTQ people in the New York Times from 1986 to 2015.
We find increasingly humanizing descriptions of LGBTQ people over time.
The ability to analyze dehumanizing language at a large scale has implications for automatically detecting and understanding media bias as well as abusive language online.
arXiv Detail & Related papers (2020-03-06T03:02:12Z) - Can x2vec Save Lives? Integrating Graph and Language Embeddings for
Automatic Mental Health Classification [91.3755431537592]
I show how merging graph and language embedding models (metapath2vec and doc2vec) avoids resource limits.
When integrated, both data produce highly accurate predictions (90%, with 10% false-positives and 12% false-negatives)
These results extend research on the importance of simultaneously analyzing behavior and language in massive networks.
arXiv Detail & Related papers (2020-01-04T20:56:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.