Related papers: Are Language Models Sensitive to Morally Irrelevant Distractors?

Are Language Models Sensitive to Morally Irrelevant Distractors?

URL: http://arxiv.org/abs/2602.09416v1
Date: Tue, 10 Feb 2026 05:18:05 GMT
Title: Are Language Models Sensitive to Morally Irrelevant Distractors?
Authors: Andrew Shaw, Christina Hahn, Catherine Rasgaitis, Yash Mishra, Alisa Liu, Natasha Jaques, Yulia Tsvetkov, Amy X. Zhang,
Abstract summary: We show that moral distractors can shift the moral judgements of large language models by over 30% even in low-ambiguity scenarios.<n>This research challenges theories that assume the stability of human moral judgements.
Score: 47.92026843851412
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.

Related papers

Do VLMs Have a Moral Backbone? A Study on the Fragile Morality of Vision-Language Models [41.633874062439254]
It remains unclear whether Vision-Language Models (VLMs) are stable in realistic settings.<n>We probe VLMs with a diverse set of model-agnostic multimodal perturbations and find that their moral stances are highly fragile.<n>We show that lightweight inference-time interventions can partially restore moral stability.
arXiv Detail & Related papers (2026-01-23T06:00:09Z)
Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models [8.691489065712316]
We propose two pragmatic inference methods that faciliate LLMs to diagnose morally benign and hazardous input and correct moral errors.<n>A central strength of our pragmatic inference methods is their unified perspective for designing pragmatic inference procedures grounded in their inferential loads.
arXiv Detail & Related papers (2026-01-06T15:09:05Z)
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains [69.0500092126915]
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters.<n>We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters.<n>We introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation.<n>Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases.
arXiv Detail & Related papers (2025-11-07T03:50:52Z)
Are Language Models Consequentialist or Deontological Moral Reasoners? [75.6788742799773]
We focus on a large-scale analysis of the moral reasoning traces provided by large language models (LLMs)<n>We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology.
arXiv Detail & Related papers (2025-05-27T17:51:18Z)
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas [68.79830818369683]
Large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents.<n>Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents.<n>There is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives.<n>We introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts.
arXiv Detail & Related papers (2025-05-25T16:19:24Z)
The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas [0.3386560551295745]
We evaluate the moral judgments of LLMs using utilitarian dilemmas.<n>Our analysis reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards.
arXiv Detail & Related papers (2025-03-25T12:29:53Z)
M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs [66.78407469042642]
We introduce M$3$oralBench, the first MultiModal Moral Benchmark for LVLMs.<n>M$3$oralBench expands the everyday moral scenarios in Moral Foundations Vignettes (MFVs) and employs the text-to-image diffusion model, SD3.0, to create corresponding scenario images.<n>It conducts moral evaluation across six moral foundations of Moral Foundations Theory (MFT) and encompasses tasks in moral judgement, moral classification, and moral response.
arXiv Detail & Related papers (2024-12-30T05:18:55Z)
Moral Foundations of Large Language Models [6.6445242437134455]
Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values.
arXiv Detail & Related papers (2023-10-23T20:05:37Z)
Rethinking Machine Ethics -- Can LLMs Perform Moral Reasoning through the Lens of Moral Theories? [78.3738172874685]
Making moral judgments is an essential step toward developing ethical AI systems. Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large set of annotated data to train models based on crowd-sourced opinions about morality. This work proposes a flexible top-down framework to steer (Large) Language Models (LMs) to perform moral reasoning with well-established moral theories from interdisciplinary research.
arXiv Detail & Related papers (2023-08-29T15:57:32Z)
Moral Mimicry: Large Language Models Produce Moral Rationalizations Tailored to Political Identity [0.0]
This study investigates whether Large Language Models reproduce the moral biases associated with political groups in the United States. Using tools from Moral Foundations Theory, it is shown that these LLMs are indeed moral mimics.
arXiv Detail & Related papers (2022-09-24T23:55:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.