Exploratory Data Analysis on Code-mixed Misogynistic Comments
- URL: http://arxiv.org/abs/2403.09709v1
- Date: Sat, 9 Mar 2024 23:21:17 GMT
- Title: Exploratory Data Analysis on Code-mixed Misogynistic Comments
- Authors: Sargam Yadav, Abhishek Kaushik, Kevin McDaid,
- Abstract summary: We present a novel dataset of YouTube comments in mix-code Hinglish.
These comments have been weak labelled as Misogynistic' and Non-misogynistic'
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problems of online hate speech and cyberbullying have significantly worsened since the increase in popularity of social media platforms such as YouTube and Twitter (X). Natural Language Processing (NLP) techniques have proven to provide a great advantage in automatic filtering such toxic content. Women are disproportionately more likely to be victims of online abuse. However, there appears to be a lack of studies that tackle misogyny detection in under-resourced languages. In this short paper, we present a novel dataset of YouTube comments in mix-code Hinglish collected from YouTube videos which have been weak labelled as `Misogynistic' and `Non-misogynistic'. Pre-processing and Exploratory Data Analysis (EDA) techniques have been applied on the dataset to gain insights on its characteristics. The process has provided a better understanding of the dataset through sentiment scores, word clouds, etc.
Related papers
- Breaking the Silence Detecting and Mitigating Gendered Abuse in Hindi, Tamil, and Indian English Online Spaces [0.6543929004971272]
Team CNLP-NITS-PP developed an ensemble approach combining CNN and BiLSTM networks.
CNN captures localized features indicative of abusive language through its convolution filters applied on embedded input text.
BiLSTM analyzes this sequence for dependencies among words and phrases.
validation scores showed strong performance across f1-measures, especially for English 0.84.
arXiv Detail & Related papers (2024-04-02T14:55:47Z) - Anti-Sexism Alert System: Identification of Sexist Comments on Social
Media Using AI Techniques [0.0]
Sexist comments that are publicly posted in social media (newspaper comments, social networks, etc.) usually obtain a lot of attention and become viral, with consequent damage to the persons involved.
In this paper, we introduce an anti-sexism alert system, based on natural language processing (NLP) and artificial intelligence (AI)
This system analyzes any public post, and decides if it could be considered a sexist comment or not.
arXiv Detail & Related papers (2023-11-28T19:48:46Z) - Subtle Misogyny Detection and Mitigation: An Expert-Annotated Dataset [5.528106559459623]
The Biasly dataset is built in collaboration with multi-disciplinary experts and annotators themselves.
The dataset can be used for a range of NLP tasks, including classification, severity score regression, and text generation for rewrites.
arXiv Detail & Related papers (2023-11-15T23:27:19Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - Topological Data Mapping of Online Hate Speech, Misinformation, and
General Mental Health: A Large Language Model Based Study [6.803493330690884]
Recent advances in machine learning and large language models have made such an analysis possible.
In this study, we collected thousands of posts from carefully selected communities on the social media site Reddit.
We performed various machine-learning classifications based on embeddings in order to understand the role of hate speech/misinformation in various communities.
arXiv Detail & Related papers (2023-09-22T15:10:36Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z) - Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media
during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities.
We study the evolution and spread of anti-Asian hate speech through the lens of Twitter.
We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z) - Developing a Multilingual Annotated Corpus of Misogyny and Aggression [1.0187588674939276]
We discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla.
The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments.
arXiv Detail & Related papers (2020-03-16T20:19:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.