Related papers: Value function interference and greedy action selection in value-based multi-objective reinforcement learning

Value function interference and greedy action selection in value-based multi-objective reinforcement learning

URL: http://arxiv.org/abs/2402.06266v1
Date: Fri, 9 Feb 2024 09:28:01 GMT
Title: Value function interference and greedy action selection in value-based multi-objective reinforcement learning
Authors: Peter Vamplew, Cameron Foale, Richard Dazeley
Abstract summary: Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) We show that, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.
Score: 1.4206639868377509
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's utility with respect to the different objectives. However, as we demonstrate here, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference in the value-function learned by the agent, leading to convergence to sub-optimal policies. This will be most prevalent in stochastic environments when optimising for the Expected Scalarised Return criterion, but we present a simple example showing that interference can also arise in deterministic environments. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.

Related papers

Reward Adaptation Via Q-Manipulation [3.8065968624597324]
We propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors. Our work represents a new approach to RA via the manipulation of Q-functions. We refer to such a method as Q-Manipulation (Q-M)
arXiv Detail & Related papers (2025-03-17T17:42:54Z)
Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL) We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$. The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z)
UCB-driven Utility Function Search for Multi-objective Reinforcement Learning [75.11267478778295]
In Multi-objective Reinforcement Learning (MORL) agents are tasked with optimising decision-making behaviours. We focus on the case of linear utility functions parameterised by weight vectors w. We introduce a method based on Upper Confidence Bound to efficiently search for the most promising weight vectors during different stages of the learning process.
arXiv Detail & Related papers (2024-05-01T09:34:42Z)
Addressing the issue of stochastic environments and local decision-making in multi-objective reinforcement learning [0.0]
Multi-objective reinforcement learning (MORL) is a relatively new field which builds on conventional Reinforcement Learning (RL) This thesis focuses on what factors influence the frequency with which value-based MORL Q-learning algorithms learn the optimal policy for an environment.
arXiv Detail & Related papers (2022-11-16T04:56:42Z)
Meta-Wrapper: Differentiable Wrapping Operator for User Interest Selection in CTR Prediction [97.99938802797377]
Click-through rate (CTR) prediction, whose goal is to predict the probability of the user to click on an item, has become increasingly significant in recommender systems. Recent deep learning models with the ability to automatically extract the user interest from his/her behaviors have achieved great success. We propose a novel approach under the framework of the wrapper method, which is named Meta-Wrapper.
arXiv Detail & Related papers (2022-06-28T03:28:15Z)
Object-Aware Regularization for Addressing Causal Confusion in Imitation Learning [131.1852444489217]
This paper presents Object-aware REgularizatiOn (OREO), a technique that regularizes an imitation policy in an object-aware manner. Our main idea is to encourage a policy to uniformly attend to all semantic objects, in order to prevent the policy from exploiting nuisance variables strongly correlated with expert actions.
arXiv Detail & Related papers (2021-10-27T01:56:23Z)
Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning [59.62721526353915]
Multi-agent settings in the real world often involve tasks with varying types and quantities of agents and non-agent entities. Our method aims to leverage these commonalities by asking the question: What is the expected utility of each agent when only considering a randomly selected sub-group of its observed entities?''
arXiv Detail & Related papers (2020-06-07T18:28:41Z)
Dynamic Value Estimation for Single-Task Multi-Scene Reinforcement Learning [22.889059874754242]
Training deep reinforcement learning agents on environments with multiple levels / scenes / conditions from the same task, has become essential for many applications. We propose a dynamic value estimation (DVE) technique for these multiple-MDP environments, motivated by the clustering effect observed in the value function distribution across different scenes.
arXiv Detail & Related papers (2020-05-25T17:56:08Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)
A utility-based analysis of equilibria in multi-objective normal form games [4.632366780742502]
We argue that compromises between competing objectives in MOMAS should be analysed on the basis of the utility that these compromises have for the users of a system. This utility-based approach naturally leads to two different optimisation criteria for agents in a MOMAS. We show that the choice of optimisation criterion can radically alter the set of equilibria in a MONFG when non-linear utility functions are used.
arXiv Detail & Related papers (2020-01-17T22:27:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.