Self-Explaining Hate Speech Detection with Moral Rationales
- URL: http://arxiv.org/abs/2601.03481v1
- Date: Wed, 07 Jan 2026 00:17:16 GMT
- Title: Self-Explaining Hate Speech Detection with Moral Rationales
- Authors: Francielle Vargas, Jackson Trager, Diego Alves, Surendrabikram Thapa, Matteo Guida, Berk Atil, Daryna Dementieva, Andrew Smart, Ameeta Agrawal,
- Abstract summary: We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment.<n>Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns.
- Score: 11.165386773222934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs
Related papers
- Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding [50.014363382140324]
Modality Importance (MI) is a simple yet effective mechanism for identifying the emotion-dominant modality.<n>MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion.<n>Results show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations.
arXiv Detail & Related papers (2025-12-02T12:29:41Z) - Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection [2.5432391525687748]
Supervised Rational Attention (SRA) is a framework that explicitly aligns model attention with human rationales.<n>SRA improves both interpretability and fairness in hate speech classification.
arXiv Detail & Related papers (2025-11-10T12:57:56Z) - MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables [50.29407048003165]
We present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature.<n>The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering.<n>Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning.
arXiv Detail & Related papers (2025-09-15T19:06:10Z) - "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas [11.229443362516207]
This study presents a comprehensive empirical evaluation of 14 leading large language models (LLMs)<n>We elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues.<n>We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why.
arXiv Detail & Related papers (2025-08-10T10:45:16Z) - MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation [6.477880844490245]
MFT CXplain is a benchmark dataset for evaluating the moral reasoning of Large Language Models.<n>It comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales.<n>Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks.
arXiv Detail & Related papers (2025-06-23T19:44:21Z) - Are Language Models Consequentialist or Deontological Moral Reasoners? [75.6788742799773]
We focus on a large-scale analysis of the moral reasoning traces provided by large language models (LLMs)<n>We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology.
arXiv Detail & Related papers (2025-05-27T17:51:18Z) - MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models [38.0475868976819]
Vision-language models have demonstrated increasing influence in morally sensitive domains such as autonomous driving and medical analysis.<n>We introduce MORALISE, a benchmark for evaluating the moral alignment of vision-language models using diverse, expert-verified real-world data.
arXiv Detail & Related papers (2025-05-20T01:11:17Z) - MoralBERT: A Fine-Tuned Language Model for Capturing Moral Values in Social Discussions [4.747987317906765]
Moral values play a fundamental role in how we evaluate information, make decisions, and form judgements around important social issues.
Recent advances in Natural Language Processing (NLP) show that moral values can be gauged in human-generated textual content.
This paper introduces MoralBERT, a range of language representation models fine-tuned to capture moral sentiment in social discourse.
arXiv Detail & Related papers (2024-03-12T14:12:59Z) - What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts
and Rationales for Disambiguating Defeasible Social and Moral Situations [48.686872351114964]
Moral or ethical judgments rely heavily on the specific contexts in which they occur.
We introduce defeasible moral reasoning: a task to provide grounded contexts that make an action more or less morally acceptable.
We distill a high-quality dataset of 1.2M entries of contextualizations and rationales for 115K defeasible moral actions.
arXiv Detail & Related papers (2023-10-24T00:51:29Z) - Rethinking Machine Ethics -- Can LLMs Perform Moral Reasoning through the Lens of Moral Theories? [78.3738172874685]
Making moral judgments is an essential step toward developing ethical AI systems.
Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large set of annotated data to train models based on crowd-sourced opinions about morality.
This work proposes a flexible top-down framework to steer (Large) Language Models (LMs) to perform moral reasoning with well-established moral theories from interdisciplinary research.
arXiv Detail & Related papers (2023-08-29T15:57:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.