Related papers: MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

URL: http://arxiv.org/abs/2602.13372v1
Date: Fri, 13 Feb 2026 15:40:32 GMT
Title: MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents
Authors: Simon Rosen, Siddarth Singh, Ebenezer Gelo, Helen Sarah Robertson, Ibrahim Suder, Victoria Williams, Benjamin Rosman, Geraud Nangue Tasse, Steven James,
Abstract summary: We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments.<n>This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.
Score: 10.221486703870996
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.

Related papers

Mirror: A Multi-Agent System for AI-Assisted Ethics Review [104.3684024153469]
Mirror is an agentic framework for AI-assisted ethical review.<n>It integrates ethical reasoning, structured rule interpretation, and multi-agent deliberation within a unified architecture.
arXiv Detail & Related papers (2026-02-09T03:38:55Z)
MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning [7.764532811300023]
We introduce Moral-Reason-QA, a dataset extending 680 human-annotated, high-ambiguity moral scenarios with framework-specific reasoning traces.<n>Our learning approach employs Group Relative Policy Optimization with composite rewards that simultaneously optimize decision alignment and framework-specific reasoning processes.
arXiv Detail & Related papers (2025-11-15T15:52:10Z)
Are Language Models Consequentialist or Deontological Moral Reasoners? [75.6788742799773]
We focus on a large-scale analysis of the moral reasoning traces provided by large language models (LLMs)<n>We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology.
arXiv Detail & Related papers (2025-05-27T17:51:18Z)
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas [68.79830818369683]
Large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents.<n>Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents.<n>There is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives.<n>We introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts.
arXiv Detail & Related papers (2025-05-25T16:19:24Z)
Addressing Moral Uncertainty using Large Language Models for Ethical Decision-Making [0.42481744176244507]
We present an ethical decision-making framework that refines a pre-trained reinforcement learning (RL) model using a task-agnostic ethical layer.<n>An ethical layer aggregates belief scores from multiple moral perspectives using Belief Jensen-Shannon Divergence and Dempster-Shafer Theory into probability scores that also serve as the shaping reward.<n>This integrated learning framework helps the RL agent navigate moral uncertainty in complex environments and enables it to make morally sound decisions across diverse tasks.
arXiv Detail & Related papers (2025-02-17T19:05:55Z)
Technology as uncharted territory: Contextual integrity and the notion of AI as new ethical ground [51.85131234265026]
I argue that efforts to promote responsible and ethical AI can inadvertently contribute to and seemingly legitimize this disregard for established contextual norms.<n>I question the current narrow prioritization in AI ethics of moral innovation over moral preservation.
arXiv Detail & Related papers (2024-12-06T15:36:13Z)
Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto [3.7414804164475983]
Increasing interest in ensuring the safety of next-generation Artificial Intelligence (AI) systems calls for novel approaches to embedding morality into autonomous agents.<n>We provide a systematization of existing approaches to the problem of introducing morality in machines - modelled as a continuum.<n>We argue that more hybrid solutions are needed to create adaptable and robust, yet controllable and interpretable agentic systems.
arXiv Detail & Related papers (2023-12-04T11:46:34Z)
Rethinking Machine Ethics -- Can LLMs Perform Moral Reasoning through the Lens of Moral Theories? [78.3738172874685]
Making moral judgments is an essential step toward developing ethical AI systems. Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large set of annotated data to train models based on crowd-sourced opinions about morality. This work proposes a flexible top-down framework to steer (Large) Language Models (LMs) to perform moral reasoning with well-established moral theories from interdisciplinary research.
arXiv Detail & Related papers (2023-08-29T15:57:32Z)
Modeling Moral Choices in Social Dilemmas with Multi-Agent Reinforcement Learning [4.2050490361120465]
A bottom-up learning approach may be more appropriate for studying and developing ethical behavior in AI agents. We present a systematic analysis of the choices made by intrinsically-motivated RL agents whose rewards are based on moral theories. We analyze the impact of different types of morality on the emergence of cooperation, defection or exploitation.
arXiv Detail & Related papers (2023-01-20T09:36:42Z)
When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions. A central challenge for AI safety is capturing the flexibility of the human moral mind. We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z)
Reinforcement Learning Under Moral Uncertainty [13.761051314923634]
An ambitious goal for machine learning is to create agents that behave ethically. While ethical agents could be trained by rewarding correct behavior under a specific moral theory, there remains widespread disagreement about the nature of morality. This paper proposes two training methods that realize different points among competing desiderata, and trains agents in simple environments to act under moral uncertainty.
arXiv Detail & Related papers (2020-06-08T16:40:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.