Multi-Objective Coordination Graphs for the Expected Scalarised Returns
with Generative Flow Models
- URL: http://arxiv.org/abs/2207.00368v1
- Date: Fri, 1 Jul 2022 12:10:15 GMT
- Title: Multi-Objective Coordination Graphs for the Expected Scalarised Returns
with Generative Flow Models
- Authors: Conor F. Hayes and Timothy Verstraeten and Diederik M. Roijers and
Enda Howley and Patrick Mannion
- Abstract summary: Key to solving real-world problems is to exploit sparse dependency structures between agents.
In wind farm control a trade-off exists between maximising power and minimising stress on the systems components.
We model such sparse dependencies between agents as a multi-objective coordination graph (MO-CoG)
- Score: 2.7648976108201815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many real-world problems contain multiple objectives and agents, where a
trade-off exists between objectives. Key to solving such problems is to exploit
sparse dependency structures that exist between agents. For example, in wind
farm control a trade-off exists between maximising power and minimising stress
on the systems components. Dependencies between turbines arise due to the wake
effect. We model such sparse dependencies between agents as a multi-objective
coordination graph (MO-CoG). In multi-objective reinforcement learning a
utility function is typically used to model a users preferences over
objectives, which may be unknown a priori. In such settings a set of optimal
policies must be computed. Which policies are optimal depends on which
optimality criterion applies. If the utility function of a user is derived from
multiple executions of a policy, the scalarised expected returns (SER) must be
optimised. If the utility of a user is derived from a single execution of a
policy, the expected scalarised returns (ESR) criterion must be optimised. For
example, wind farms are subjected to constraints and regulations that must be
adhered to at all times, therefore the ESR criterion must be optimised. For
MO-CoGs, the state-of-the-art algorithms can only compute a set of optimal
policies for the SER criterion, leaving the ESR criterion understudied. To
compute a set of optimal polices under the ESR criterion, also known as the ESR
set, distributions over the returns must be maintained. Therefore, to compute a
set of optimal policies under the ESR criterion for MO-CoGs, we present a novel
distributional multi-objective variable elimination (DMOVE) algorithm. We
evaluate DMOVE in realistic wind farm simulations. Given the returns in
real-world wind farm settings are continuous, we utilise a model known as
real-NVP to learn the continuous return distributions to calculate the ESR set.
Related papers
- Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms.
We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths.
We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z) - LLM-enhanced Reranking in Recommender Systems [49.969932092129305]
Reranking is a critical component in recommender systems, playing an essential role in refining the output of recommendation algorithms.
We introduce a comprehensive reranking framework, designed to seamlessly integrate various reranking criteria.
A customizable input mechanism is also integrated, enabling the tuning of the language model's focus to meet specific reranking needs.
arXiv Detail & Related papers (2024-06-18T09:29:18Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Generative Slate Recommendation with Reinforcement Learning [49.75985313698214]
reinforcement learning algorithms can be used to optimize user engagement in recommender systems.
However, RL approaches are intractable in the slate recommendation scenario.
In that setting, an action corresponds to a slate that may contain any combination of items.
In this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder.
We are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates.
arXiv Detail & Related papers (2023-01-20T15:28:09Z) - Machine Learning Simulates Agent-Based Model Towards Policy [0.0]
We use a random forest machine learning algorithm to emulate an agent-based model (ABM) and evaluate competing policies across 46 Metropolitan Regions (MRs) in Brazil.
As a result, we obtain the optimal (and non-optimal) performance of each region over the policies.
Results suggest that MRs already have embedded structures that favor optimal or non-optimal results, but they also illustrate which policy is more beneficial to each place.
arXiv Detail & Related papers (2022-03-04T21:19:11Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Expected Scalarised Returns Dominance: A New Solution Concept for
Multi-Objective Decision Making [4.117597517886004]
In many real-world scenarios, the utility of a user is derived from the single execution of a policy.
To apply multi-objective reinforcement learning, the expected utility of the returns must be optimised.
We propose first-order dominance as a criterion to build solution sets to maximise expected utility.
We then define a new solution concept called the ESR set, which is a set of policies that are ESR dominant.
arXiv Detail & Related papers (2021-06-02T09:42:42Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z) - A utility-based analysis of equilibria in multi-objective normal form
games [4.632366780742502]
We argue that compromises between competing objectives in MOMAS should be analysed on the basis of the utility that these compromises have for the users of a system.
This utility-based approach naturally leads to two different optimisation criteria for agents in a MOMAS.
We show that the choice of optimisation criterion can radically alter the set of equilibria in a MONFG when non-linear utility functions are used.
arXiv Detail & Related papers (2020-01-17T22:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.