HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse
- URL: http://arxiv.org/abs/2603.02684v1
- Date: Tue, 03 Mar 2026 07:26:53 GMT
- Title: HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse
- Authors: Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya, Md. Shad Akhtar,
- Abstract summary: We present HateMirage, a novel dataset of Faux Hate comments.<n>Each comment is annotated along three interpretable dimensions: Target, Intent, and Implication.<n>We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence.
- Score: 12.969019733646414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.
Related papers
- Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning [79.95774256444956]
The lack of reasoning capabilities in Vision-Language Models has remained at the forefront of research discourse.<n>We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics.
arXiv Detail & Related papers (2026-02-26T18:54:06Z) - Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models [65.23999399834638]
We introduce DeceptionDecoded, a benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles.<n>The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities.<n>It supports three intent-centric tasks: misleading intent detection, misleading source attribution, and creator desire inference.
arXiv Detail & Related papers (2025-05-21T13:14:32Z) - Missci: Reconstructing Fallacies in Misrepresented Science [84.32990746227385]
Health-related misinformation on social networks can lead to poor decision-making and real-world dangers.
Missci is a novel argumentation theoretical model for fallacious reasoning.
We present Missci as a dataset to test the critical reasoning abilities of large language models.
arXiv Detail & Related papers (2024-06-05T12:11:10Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning [29.519687405350304]
We introduce a hate speech detection framework, HARE, which harnesses the reasoning capabilities of large language models (LLMs) to fill gaps in explanations of hate speech.
Experiments on SBIC and Implicit Hate benchmarks show that our method, using model-generated data, consistently outperforms baselines.
Our method enhances the explanation quality of trained models and improves generalization to unseen datasets.
arXiv Detail & Related papers (2023-11-01T06:09:54Z) - Revisiting Hate Speech Benchmarks: From Data Curation to System
Deployment [26.504056750529124]
We present GOTHate, a large-scale code-mixed crowdsourced dataset of around 51k posts for hate speech detection from Twitter.
We benchmark it with 10 recent baselines and investigate how adding endogenous signals enhances the hate speech detection task.
Our solution HEN-mBERT is a modular, multilingual, mixture-of-experts model that enriches the linguistic subspace with latent endogenous signals.
arXiv Detail & Related papers (2023-06-01T19:36:52Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - An Information Retrieval Approach to Building Datasets for Hate Speech
Detection [3.587367153279349]
A common practice is to only annotate tweets containing known hate words''
A second challenge is that definitions of hate speech tend to be highly variable and subjective.
Our key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR)
arXiv Detail & Related papers (2021-06-17T19:25:39Z) - HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection [27.05719607624675]
We introduce HateXplain, the first benchmark hate speech dataset covering multiple aspects of the issue.
Each post in our dataset is annotated from three different perspectives.
We observe that models, which utilize the human rationales for training, perform better in reducing unintended bias towards target communities.
arXiv Detail & Related papers (2020-12-18T15:12:14Z) - Transfer Learning for Hate Speech Detection in Social Media [14.759208309842178]
This paper uses a transfer learning technique to leverage two independent datasets jointly.
We build an interpretable two-dimensional visualization tool of the constructed hate speech representation -- dubbed the Map of Hate.
We show that the joint representation boosts prediction performances when only a limited amount of supervision is available.
arXiv Detail & Related papers (2019-06-10T08:00:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.