Developing a Multilingual Annotated Corpus of Misogyny and Aggression
- URL: http://arxiv.org/abs/2003.07428v1
- Date: Mon, 16 Mar 2020 20:19:21 GMT
- Title: Developing a Multilingual Annotated Corpus of Misogyny and Aggression
- Authors: Shiladitya Bhattacharya, Siddharth Singh, Ritesh Kumar, Akanksha
Bansal, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Atul Kr. Ojha
- Abstract summary: We discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla.
The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments.
- Score: 1.0187588674939276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we discuss the development of a multilingual annotated corpus
of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part
of a project on studying and automatically identifying misogyny and communalism
on social media (the ComMA Project). The dataset is collected from comments on
YouTube videos and currently contains a total of over 20,000 comments. The
comments are annotated at two levels - aggression (overtly aggressive, covertly
aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We
describe the process of data collection, the tagset used for annotation, and
issues and challenges faced during the process of annotation. Finally, we
discuss the results of the baseline experiments conducted to develop a
classifier for misogyny in the three languages.
Related papers
- Beyond Binary Gender: Evaluating Gender-Inclusive Machine Translation with Ambiguous Attitude Words [85.48043537327258]
Existing machine translation gender bias evaluations are primarily focused on male and female genders.
This study presents a benchmark AmbGIMT (Gender-Inclusive Machine Translation with Ambiguous attitude words)
We propose a novel process to evaluate gender bias based on the Emotional Attitude Score (EAS), which is used to quantify ambiguous attitude words.
arXiv Detail & Related papers (2024-07-23T08:13:51Z) - A multitask learning framework for leveraging subjectivity of annotators to identify misogyny [47.175010006458436]
We propose a multitask learning approach to enhance the performance of the misogyny identification systems.
We incorporated diverse perspectives from annotators in our model design, considering gender and age across six profile groups.
This research advances content moderation and highlights the importance of embracing diverse perspectives to build effective online moderation systems.
arXiv Detail & Related papers (2024-06-22T15:06:08Z) - Exploratory Data Analysis on Code-mixed Misogynistic Comments [0.0]
We present a novel dataset of YouTube comments in mix-code Hinglish.
These comments have been weak labelled as Misogynistic' and Non-misogynistic'
arXiv Detail & Related papers (2024-03-09T23:21:17Z) - Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis [44.17106903728264]
Most hate speech datasets neglect the cultural diversity within a single language.
To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset.
Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%.
arXiv Detail & Related papers (2023-08-31T13:14:47Z) - "I'm fully who I am": Towards Centering Transgender and Non-Binary
Voices to Measure Biases in Open Language Generation [69.25368160338043]
Transgender and non-binary (TGNB) individuals disproportionately experience discrimination and exclusion from daily life.
We assess how the social reality surrounding experienced marginalization of TGNB persons contributes to and persists within Open Language Generation.
We introduce TANGO, a dataset of template-based real-world text curated from a TGNB-oriented community.
arXiv Detail & Related papers (2023-05-17T04:21:45Z) - Hate Speech and Offensive Language Detection in Bengali [5.765076125746209]
We develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets.
We implement several baseline models for the classification of such hateful posts.
We also explore the interlingual transfer mechanism to boost classification performance.
arXiv Detail & Related papers (2022-10-07T12:06:04Z) - Deep Multi-Task Models for Misogyny Identification and Categorization on
Arabic Social Media [6.6410040715586005]
In this paper, we present the submitted systems to the first Arabic Misogyny Identification shared task.
We investigate three multi-task learning models as well as their single-task counterparts.
In order to encode the input text, our models rely on the pre-trained MARBERT language model.
arXiv Detail & Related papers (2022-06-16T18:54:37Z) - The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual
Social Media Discourse [1.465840097113565]
We discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur.
The initial dataset consists of a total 15,000 annotated comments in four languages.
As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English.
arXiv Detail & Related papers (2021-11-19T19:03:22Z) - Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language [0.0]
We introduce an Arabic Levantine Twitter dataset for Misogynistic language (LeT-Mi) to be the first benchmark dataset for Arabic misogyny.
Let-Mi was used as an evaluation dataset through binary/multi-/target classification tasks conducted by several state-of-the-art machine learning systems.
arXiv Detail & Related papers (2021-03-18T12:01:13Z) - Hostility Detection Dataset in Hindi [44.221862384125245]
We collect and manually annotate 8200 online posts in Hindi language.
The dataset is considered for multi-label tags due to a significant overlap among the hostile classes.
arXiv Detail & Related papers (2020-11-06T20:33:12Z) - A Framework for the Computational Linguistic Analysis of Dehumanization [52.735780962665814]
We analyze discussions of LGBTQ people in the New York Times from 1986 to 2015.
We find increasingly humanizing descriptions of LGBTQ people over time.
The ability to analyze dehumanizing language at a large scale has implications for automatically detecting and understanding media bias as well as abusive language online.
arXiv Detail & Related papers (2020-03-06T03:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.