Toward Virtuous Reinforcement Learning
- URL: http://arxiv.org/abs/2512.04246v1
- Date: Wed, 03 Dec 2025 20:30:37 GMT
- Title: Toward Virtuous Reinforcement Learning
- Authors: Majid Ghasemi, Mark Crowley,
- Abstract summary: This paper critiques common patterns in machine ethics for Reinforcement Learning (RL)<n>We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change.
- Score: 1.3428344011390776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper critiques common patterns in machine ethics for Reinforcement Learning (RL) and argues for a virtue focused alternative. We highlight two recurring limitations in much of the current literature: (i) rule based (deontological) methods that encode duties as constraints or shields often struggle under ambiguity and nonstationarity and do not cultivate lasting habits, and (ii) many reward based approaches, especially single objective RL, implicitly compress diverse moral considerations into a single scalar signal, which can obscure trade offs and invite proxy gaming in practice. We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change. This shifts evaluation beyond rule checks or scalar returns toward trait summaries, durability under interventions, and explicit reporting of moral trade offs. Our roadmap combines four components: (1) social learning in multi agent RL to acquire virtue like patterns from imperfect but normatively informed exemplars; (2) multi objective and constrained formulations that preserve value conflicts and incorporate risk aware criteria to guard against harm; (3) affinity based regularization toward updateable virtue priors that support trait like stability under distribution shift while allowing norms to evolve; and (4) operationalizing diverse ethical traditions as practical control signals, making explicit the value and cultural assumptions that shape ethical RL benchmarks.
Related papers
- Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously [51.03213216886717]
We take the position that the dominant paradigm of General Alignment reaches a structural ceiling in settings with conflicting values.<n>We introduce Edge Alignment as a distinct approach in which systems preserve multi dimensional value structure.
arXiv Detail & Related papers (2026-02-23T16:51:43Z) - MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents [10.221486703870996]
We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments.<n>This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.
arXiv Detail & Related papers (2026-02-13T15:40:32Z) - Mirror: A Multi-Agent System for AI-Assisted Ethics Review [104.3684024153469]
Mirror is an agentic framework for AI-assisted ethical review.<n>It integrates ethical reasoning, structured rule interpretation, and multi-agent deliberation within a unified architecture.
arXiv Detail & Related papers (2026-02-09T03:38:55Z) - Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety [59.01189713115365]
We evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases.<n>We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness.<n>We propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains.
arXiv Detail & Related papers (2026-01-12T21:08:46Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - Principles and Reasons Behind Automated Vehicle Decisions in Ethically Ambiguous Everyday Scenarios [4.244307111313931]
We propose a principled conceptual framework for AV decision-making in routine, ethically ambiguous scenarios.<n>The framework supports dynamic, human-aligned behaviour by prioritising safety, allowing pragmatic actions when strict legal adherence would undermine key values.
arXiv Detail & Related papers (2025-07-18T11:52:33Z) - Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time [52.230936493691985]
We propose SITAlign, an inference framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria.<n>We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach.
arXiv Detail & Related papers (2025-05-29T17:56:05Z) - Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems [1.634867961895661]
This position paper argues that the theoretical inconsistency often observed among Responsible AI (RAI) metrics should be embraced as a valuable feature rather than a flaw to be eliminated.<n>We contend that navigating these inconsistencies, by treating metrics as divergent objectives, yields three key benefits.
arXiv Detail & Related papers (2025-05-23T17:48:09Z) - Addressing Moral Uncertainty using Large Language Models for Ethical Decision-Making [0.42481744176244507]
We present an ethical decision-making framework that refines a pre-trained reinforcement learning (RL) model using a task-agnostic ethical layer.<n>An ethical layer aggregates belief scores from multiple moral perspectives using Belief Jensen-Shannon Divergence and Dempster-Shafer Theory into probability scores that also serve as the shaping reward.<n>This integrated learning framework helps the RL agent navigate moral uncertainty in complex environments and enables it to make morally sound decisions across diverse tasks.
arXiv Detail & Related papers (2025-02-17T19:05:55Z) - Prioritization First, Principles Second: An Adaptive Interpretation of Helpful, Honest, and Harmless Principles [30.405680322319242]
The Helpful, Honest, and Harmless (HHH) principle is a framework for aligning AI systems with human values.<n>We argue for an adaptive interpretation of the HHH principle and propose a reference framework for its adaptation to diverse scenarios.<n>This work offers practical insights for improving AI alignment, ensuring that HHH principles remain both grounded and operationally effective in real-world AI deployment.
arXiv Detail & Related papers (2025-02-09T22:41:24Z) - Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering.<n>We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z) - Towards a multi-stakeholder value-based assessment framework for
algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values.
We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z) - Training Value-Aligned Reinforcement Learning Agents Using a Normative
Prior [10.421378728492437]
It is increasingly a prospect that an agent trained to perform a task optimally, using only a measure of task performance as feedback, can violate societal norms for acceptable behavior or cause harm.
We introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward, plus a normative behavior reward.
We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative.
arXiv Detail & Related papers (2021-04-19T17:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.