Algorithms for learning value-aligned policies considering admissibility relaxation
- URL: http://arxiv.org/abs/2406.04838v1
- Date: Fri, 7 Jun 2024 11:10:07 GMT
- Title: Algorithms for learning value-aligned policies considering admissibility relaxation
- Authors: Andrés Holgado-Sánchez, Joaquín Arias, Holger Billhardt, Sascha Ossowski,
- Abstract summary: The emerging field of emphvalue awareness engineering claims that software agents and systems should be value-aware.
In this paper, we present two algorithms, $epsilontext-ADQL$ for strategies based on local alignment and its extension $epsilontext-CADQL$ for a sequence of decisions.
We have validated their efficiency in a water distribution problem in a drought scenario.
- Score: 1.8336820954218835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emerging field of \emph{value awareness engineering} claims that software agents and systems should be value-aware, i.e. they must make decisions in accordance with human values. In this context, such agents must be capable of explicitly reasoning as to how far different courses of action are aligned with these values. For this purpose, values are often modelled as preferences over states or actions, which are then aggregated to determine the sequences of actions that are maximally aligned with a certain value. Recently, additional value admissibility constraints at this level have been considered as well. However, often relaxed versions of these constraints are needed, and this increases considerably the complexity of computing value-aligned policies. To obtain efficient algorithms that make value-aligned decisions considering admissibility relaxation, we propose the use of learning techniques, in particular, we have used constrained reinforcement learning algorithms. In this paper, we present two algorithms, $\epsilon\text{-}ADQL$ for strategies based on local alignment and its extension $\epsilon\text{-}CADQL$ for a sequence of decisions. We have validated their efficiency in a water distribution problem in a drought scenario.
Related papers
- PAC Learnability of Scenario Decision-Making Algorithms: Necessary and Sufficient Conditions [0.7673339435080445]
PAC is the ability to make a decision that has an arbitrarily low risk of violating an unknown safety constraint.
Sufficient conditions for scenario decision-making algorithms to be PAC are available in the literature.
We derive a necessary condition for scenario decision-making algorithms to be PAC, inspired by the VC dimension and the so-called no-free-lunch theorem.
arXiv Detail & Related papers (2025-01-15T15:57:13Z) - Handling Cost and Constraints with Off-Policy Deep Reinforcement
Learning [2.793095554369282]
Most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data.
We revisit this strategy in environments with "mixed-sign" reward functions.
We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting.
arXiv Detail & Related papers (2023-11-30T16:31:04Z) - Resilient Constrained Learning [94.27081585149836]
This paper presents a constrained learning approach that adapts the requirements while simultaneously solving the learning task.
We call this approach resilient constrained learning after the term used to describe ecological systems that adapt to disruptions by modifying their operation.
arXiv Detail & Related papers (2023-06-04T18:14:18Z) - Lexicographic Multi-Objective Reinforcement Learning [65.90380946224869]
We present a family of both action-value and policy gradient algorithms that can be used to solve such problems.
We show how our algorithms can be used to impose safety constraints on the behaviour of an agent, and compare their performance in this context with that of other constrained reinforcement learning algorithms.
arXiv Detail & Related papers (2022-12-28T10:22:36Z) - Safe Online Bid Optimization with Return-On-Investment and Budget
Constraints subject to Uncertainty [87.81197574939355]
We study the nature of both the optimization and learning problems.
We provide an algorithm, namely GCB, guaranteeing sublinear regret at the cost of a potentially linear number of constraints violations.
More interestingly, we provide an algorithm, namely GCB_safe(psi,phi), guaranteeing both sublinear pseudo-regret and safety w.h.p. at the cost of accepting tolerances psi and phi.
arXiv Detail & Related papers (2022-01-18T17:24:20Z) - Reinforcement Learning with Almost Sure Constraints [1.0323063834827415]
We argue that stationary policies are not sufficient for solving this problem.
We show that the minimal budget required to act safely can be obtained as the smallest fixed point of a Bellman-like operator.
arXiv Detail & Related papers (2021-12-09T20:07:53Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Average Reward Adjusted Discounted Reinforcement Learning:
Near-Blackwell-Optimal Policies for Real-World Applications [0.0]
Reinforcement learning aims at finding the best stationary policy for a given Markov Decision Process.
This paper provides deep theoretical insights to the widely applied standard discounted reinforcement learning framework.
We establish a novel near-Blackwell-optimal reinforcement learning algorithm.
arXiv Detail & Related papers (2020-04-02T08:05:18Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.