Related papers: Ethics2vec: aligning automatic agents and human preferences

Ethics2vec: aligning automatic agents and human preferences

URL: http://arxiv.org/abs/2508.07673v1
Date: Mon, 11 Aug 2025 06:52:46 GMT
Title: Ethics2vec: aligning automatic agents and human preferences
Authors: Gianluca Bontempi,
Abstract summary: This paper proposes a way to map an automatic agent decision-making (or control law) strategy to a multivariate vector representation.<n>The Ethics2Vec method is first introduced in the case of an automatic agent performing binary decision-making.
Score: 0.19580473532948395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Though intelligent agents are supposed to improve human experience (or make it more efficient), it is hard from a human perspective to grasp the ethical values which are explicitly or implicitly embedded in an agent behaviour. This is the well-known problem of alignment, which refers to the challenge of designing AI systems that align with human values, goals and preferences. This problem is particularly challenging since most human ethical considerations refer to \emph{incommensurable} (i.e. non-measurable and/or incomparable) values and criteria. Consider, for instance, a medical agent prescribing a treatment to a cancerous patient. How could it take into account (and/or weigh) incommensurable aspects like the value of a human life and the cost of the treatment? Now, the alignment between human and artificial values is possible only if we define a common space where a metric can be defined and used. This paper proposes to extend to ethics the conventional Anything2vec approach, which has been successful in plenty of similar and hard-to-quantify domains (ranging from natural language processing to recommendation systems and graph analysis). This paper proposes a way to map an automatic agent decision-making (or control law) strategy to a multivariate vector representation, which can be used to compare and assess the alignment with human values. The Ethics2Vec method is first introduced in the case of an automatic agent performing binary decision-making. Then, a vectorisation of an automatic control law (like in the case of a self-driving car) is discussed to show how the approach can be extended to automatic control settings.

Related papers

Learning the Value Systems of Agents with Preference-based and Inverse Reinforcement Learning [1.6970482663318245]
Agreement Technologies refer to open computer systems in which autonomous software agents interact with one another.<n>We propose a novel method to automatically emphlearn value systems from observations and human demonstrations.
arXiv Detail & Related papers (2026-02-04T13:07:15Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention [0.0]
Key risk is value drift, where AI systems deviate from aligned values due to evolving contexts, learning dynamics, or unintended optimizations.<n>We propose the Moral Anchor System (MAS), a novel framework to detect, predict, and mitigate value drift in AI agents.
arXiv Detail & Related papers (2025-10-05T07:24:23Z)
The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships? [11.29688025465972]
The Shepherd Test is a new conceptual test for assessing the moral and relational dimensions of superintelligent artificial agents.<n>We argue that AI crosses an important, and potentially dangerous, threshold of intelligence when it exhibits the ability to manipulate, nurture, and instrumentally use less intelligent agents.<n>This includes the ability to weigh moral trade-offs between self-interest and the well-being of subordinate agents.
arXiv Detail & Related papers (2025-06-02T15:53:56Z)
Beyond Preferences in AI Alignment [15.878773061188516]
We characterize and challenge the preferentist approach to AI alignment. We show how preferences fail to capture the thick semantic content of human values. We argue that AI systems should be aligned with normative standards appropriate to their social roles.
arXiv Detail & Related papers (2024-08-30T03:14:20Z)
Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions [101.67121669727354]
Recent advancements in AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. The lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. We introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML)
arXiv Detail & Related papers (2024-06-13T16:03:25Z)
Quantifying Misalignment Between Agents: Towards a Sociotechnical Understanding of Alignment [2.619545850602691]
Recent sociotechnical approaches highlight the need to understand complex misalignment among multiple human and AI agents.<n>We adapt a computational social science model of human contention to the alignment problem.<n>Our model quantifies misalignment in large, diverse agent groups with potentially conflicting goals.
arXiv Detail & Related papers (2024-06-06T16:31:22Z)
Measuring Value Alignment [12.696227679697493]
This paper introduces a novel formalism to quantify the alignment between AI systems and human values. By utilizing this formalism, AI developers and ethicists can better design and evaluate AI systems to ensure they operate in harmony with human values.
arXiv Detail & Related papers (2023-12-23T12:30:06Z)
Learning Human-like Representations to Enable Learning Human Values [11.236150405125754]
We explore the effects of representational alignment between humans and AI agents on learning human values. We show that this kind of representational alignment can support safely learning and exploring human values in the context of personalization.
arXiv Detail & Related papers (2023-12-21T18:31:33Z)
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties [68.66719970507273]
Value pluralism is the view that multiple correct values may be held in tension with one another. As statistical learners, AI systems fit to averages by default, washing out potentially irreducible value conflicts. We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations.
arXiv Detail & Related papers (2023-09-02T01:24:59Z)
Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans. We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios. We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z)
Metaethical Perspectives on 'Benchmarking' AI Ethics [81.65697003067841]
Benchmarks are seen as the cornerstone for measuring technical progress in Artificial Intelligence (AI) research. An increasingly prominent research area in AI is ethics, which currently has no set of benchmarks nor commonly accepted way for measuring the 'ethicality' of an AI system. We argue that it makes more sense to talk about 'values' rather than 'ethics' when considering the possible actions of present and future AI systems.
arXiv Detail & Related papers (2022-04-11T14:36:39Z)
Indecision Modeling [50.00689136829134]
It is important that AI systems act in ways which align with human values. People are often indecisive, and especially so when their decision has moral implications.
arXiv Detail & Related papers (2020-12-15T18:32:37Z)
A Case for Humans-in-the-Loop: Decisions in the Presence of Erroneous Algorithmic Scores [85.12096045419686]
We study the adoption of an algorithmic tool used to assist child maltreatment hotline screening decisions. We first show that humans do alter their behavior when the tool is deployed. We show that humans are less likely to adhere to the machine's recommendation when the score displayed is an incorrect estimate of risk.
arXiv Detail & Related papers (2020-02-19T07:27:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.