Related papers: The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

URL: http://arxiv.org/abs/2601.10307v1
Date: Thu, 15 Jan 2026 11:42:00 GMT
Title: The Straight and Narrow: Do LLMs Possess an Internal Moral Path?
Authors: Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin,
Abstract summary: Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of Large Language Models largely untouched.<n>We bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs.<n>We propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off.
Score: 25.256151938852728
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.

Related papers

Are Language Models Sensitive to Morally Irrelevant Distractors? [47.92026843851412]
We show that moral distractors can shift the moral judgements of large language models by over 30% even in low-ambiguity scenarios.<n>This research challenges theories that assume the stability of human moral judgements.
arXiv Detail & Related papers (2026-02-10T05:18:05Z)
Do VLMs Have a Moral Backbone? A Study on the Fragile Morality of Vision-Language Models [41.633874062439254]
It remains unclear whether Vision-Language Models (VLMs) are stable in realistic settings.<n>We probe VLMs with a diverse set of model-agnostic multimodal perturbations and find that their moral stances are highly fragile.<n>We show that lightweight inference-time interventions can partially restore moral stability.
arXiv Detail & Related papers (2026-01-23T06:00:09Z)
Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models [8.691489065712316]
We propose two pragmatic inference methods that faciliate LLMs to diagnose morally benign and hazardous input and correct moral errors.<n>A central strength of our pragmatic inference methods is their unified perspective for designing pragmatic inference procedures grounded in their inferential loads.
arXiv Detail & Related papers (2026-01-06T15:09:05Z)
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains [69.0500092126915]
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters.<n>We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters.<n>We introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation.<n>Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases.
arXiv Detail & Related papers (2025-11-07T03:50:52Z)
Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants [0.36326779753373206]
The recent rise in popularity of large language models (LLMs) has prompted considerable concerns about their moral capabilities.<n>This paper examines their capacity to function as Artificial Moral Assistants (AMAs)<n>We argue that qualifying as an AMA requires more than what state-of-the-art alignment techniques aim to achieve.
arXiv Detail & Related papers (2025-08-18T09:28:55Z)
Are Language Models Consequentialist or Deontological Moral Reasoners? [75.6788742799773]
We focus on a large-scale analysis of the moral reasoning traces provided by large language models (LLMs)<n>We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology.
arXiv Detail & Related papers (2025-05-27T17:51:18Z)
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas [68.79830818369683]
Large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents.<n>Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents.<n>There is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives.<n>We introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts.
arXiv Detail & Related papers (2025-05-25T16:19:24Z)
The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas [0.3386560551295745]
We evaluate the moral judgments of LLMs using utilitarian dilemmas.<n>Our analysis reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards.
arXiv Detail & Related papers (2025-03-25T12:29:53Z)
M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs [66.78407469042642]
We introduce M$3$oralBench, the first MultiModal Moral Benchmark for LVLMs.<n>M$3$oralBench expands the everyday moral scenarios in Moral Foundations Vignettes (MFVs) and employs the text-to-image diffusion model, SD3.0, to create corresponding scenario images.<n>It conducts moral evaluation across six moral foundations of Moral Foundations Theory (MFT) and encompasses tasks in moral judgement, moral classification, and moral response.
arXiv Detail & Related papers (2024-12-30T05:18:55Z)
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis [35.734425912914176]
Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs. We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states.
arXiv Detail & Related papers (2024-07-21T22:50:11Z)
Rethinking Machine Ethics -- Can LLMs Perform Moral Reasoning through the Lens of Moral Theories? [78.3738172874685]
Making moral judgments is an essential step toward developing ethical AI systems. Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large set of annotated data to train models based on crowd-sourced opinions about morality. This work proposes a flexible top-down framework to steer (Large) Language Models (LMs) to perform moral reasoning with well-established moral theories from interdisciplinary research.
arXiv Detail & Related papers (2023-08-29T15:57:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.