The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual
Social Media Discourse
- URL: http://arxiv.org/abs/2111.10390v1
- Date: Fri, 19 Nov 2021 19:03:22 GMT
- Title: The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual
Social Media Discourse
- Authors: Ritesh Kumar and Enakshi Nandi and Laishram Niranjana Devi and Shyam
Ratan and Siddharth Singh and Akash Bhagat and Yogesh Dawer
- Abstract summary: We discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur.
The initial dataset consists of a total 15,000 annotated comments in four languages.
As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English.
- Score: 1.465840097113565
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we discuss the development of a multilingual dataset annotated
with a hierarchical, fine-grained tagset marking different types of aggression
and the "context" in which they occur. The context, here, is defined by the
conversational thread in which a specific comment occurs and also the "type" of
discursive role that the comment is performing with respect to the previous
comment. The initial dataset, being discussed here (and made available as part
of the ComMA@ICON shared task), consists of a total 15,000 annotated comments
in four languages - Meitei, Bangla, Hindi, and Indian English - collected from
various social media platforms such as YouTube, Facebook, Twitter and Telegram.
As is usual on social media websites, a large number of these comments are
multilingual, mostly code-mixed with English. The paper gives a detailed
description of the tagset being used for annotation and also the process of
developing a multi-label, fine-grained tagset that can be used for marking
comments with aggression and bias of various kinds including gender bias,
religious intolerance (called communal bias in the tagset), class/caste bias
and ethnic/racial bias. We also define and discuss the tags that have been used
for marking different the discursive role being performed through the comments,
such as attack, defend, etc. We also present a statistical analysis of the
dataset as well as results of our baseline experiments with developing an
automatic aggression identification system using the dataset developed.
Related papers
- cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media
Comments using Spatio-Temporally Retrained Language Models [0.9012198585960441]
This paper describes our multiclass classification system developed as part of the LTERAN@LP-2023 shared task.
We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions.
We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score.
arXiv Detail & Related papers (2023-08-20T21:30:34Z) - Subjective Crowd Disagreements for Subjective Data: Uncovering
Meaningful CrowdOpinion with Population-level Learning [8.530934084017966]
We introduce emphCrowdOpinion, an unsupervised learning approach that uses language features and label distributions to pool similar items into larger samples of label distributions.
We use five publicly available benchmark datasets (with varying levels of annotator disagreements) from social media.
We also experiment in the wild using a dataset from Facebook, where annotations come from the platform itself by users reacting to posts.
arXiv Detail & Related papers (2023-07-07T22:09:46Z) - SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis
Dataset and its Evaluation [0.9894420655516565]
SentiGOLD adheres to established linguistic conventions agreed upon by the Government of Bangladesh and a Bangla linguistics committee.
The dataset incorporates data from online video comments, social media posts, blogs, news, and other sources while maintaining domain and class distribution rigorously.
The top model achieves a macro f1 score of 0.62 (intra-dataset) across 5 classes, setting a benchmark, and 0.61 (cross-dataset from SentNoB) across 3 classes, comparable to the state-of-the-art.
arXiv Detail & Related papers (2023-06-09T12:07:10Z) - Micro-video Tagging via Jointly Modeling Social Influence and Tag
Relation [56.23157334014773]
85.7% of micro-videos lack annotation.
Existing methods mostly focus on analyzing video content, neglecting users' social influence and tag relation.
We formulate micro-video tagging as a link prediction problem in a constructed heterogeneous network.
arXiv Detail & Related papers (2023-03-15T02:13:34Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - DravidianCodeMix: Sentiment Analysis and Offensive Language
Identification Dataset for Dravidian Languages in Code-Mixed Text [0.9738927161150494]
The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English.
The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha.
arXiv Detail & Related papers (2021-06-17T13:13:26Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Developing a Multilingual Annotated Corpus of Misogyny and Aggression [1.0187588674939276]
We discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla.
The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments.
arXiv Detail & Related papers (2020-03-16T20:19:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.