Optimizing Return Distributions with Distributional Dynamic Programming
- URL: http://arxiv.org/abs/2501.13028v1
- Date: Wed, 22 Jan 2025 17:20:43 GMT
- Title: Optimizing Return Distributions with Distributional Dynamic Programming
- Authors: Bernardo Ávila Pires, Mark Rowland, Diana Borsa, Zhaohan Daniel Guo, Khimya Khetarpal, André Barreto, David Abel, Rémi Munos, Will Dabney,
- Abstract summary: We introduce distributional dynamic programming (DP) methods for optimizing statistical functionals of the return distribution.<n>To go beyond expected utilities, we combine distributional DP with stock augmentation, a technique previously introduced for classic DP in the context of risk-sensitive RL.<n>We describe a number of applications outlining how to use distributional DP to solve different stock-augmented return distribution optimization problems.
- Score: 38.11199286025947
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce distributional dynamic programming (DP) methods for optimizing statistical functionals of the return distribution, with standard reinforcement learning as a special case. Previous distributional DP methods could optimize the same class of expected utilities as classic DP. To go beyond expected utilities, we combine distributional DP with stock augmentation, a technique previously introduced for classic DP in the context of risk-sensitive RL, where the MDP state is augmented with a statistic of the rewards obtained so far (since the first time step). We find that a number of recently studied problems can be formulated as stock-augmented return distribution optimization, and we show that we can use distributional DP to solve them. We analyze distributional value and policy iteration, with bounds and a study of what objectives these distributional DP methods can or cannot optimize. We describe a number of applications outlining how to use distributional DP to solve different stock-augmented return distribution optimization problems, for example maximizing conditional value-at-risk, and homeostatic regulation. To highlight the practical potential of stock-augmented return distribution optimization and distributional DP, we combine the core ideas of distributional value iteration with the deep RL agent DQN, and empirically evaluate it for solving instances of the applications discussed.
Related papers
- A Differential Perspective on Distributional Reinforcement Learning [7.028778922533688]
We extend distributional reinforcement learning to the average-reward setting, where an agent aims to optimize the reward received per time-step.<n>In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution.
arXiv Detail & Related papers (2025-06-03T19:26:25Z) - Generative Diffusion Models for Resource Allocation in Wireless Networks [77.36145730415045]
We train a policy to imitate an expert and generate new samples from the optimal distribution.<n>We achieve near-optimal performance through the sequential execution of the generated samples.<n>We present numerical results in a case study of power control.
arXiv Detail & Related papers (2025-04-28T21:44:31Z) - Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.
Current approaches typically address this issue through online sampling from the target policy.
We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Direct Distributional Optimization for Provable Alignment of Diffusion Models [39.048284342436666]
We introduce a novel alignment method for diffusion models from distribution optimization perspectives.
We first formulate the problem as a generic regularized loss minimization over probability distributions.
We enable sampling from the learned distribution by approximating its score function via Doob's $h$-transform technique.
arXiv Detail & Related papers (2025-02-05T07:35:15Z) - Discrete Probabilistic Inference as Control in Multi-path Environments [84.67055173040107]
We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem.
We show that GFlowNets learn a policy that samples objects proportionally to their reward by enforcing a conservation of flows.
We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward.
arXiv Detail & Related papers (2024-02-15T20:20:35Z) - A Distributional Analogue to the Successor Representation [54.99439648059807]
This paper contributes a new approach for distributional reinforcement learning.
It elucidates a clean separation of transition structure and reward in the learning process.
As an illustration, we show that it enables zero-shot risk-sensitive policy evaluation.
arXiv Detail & Related papers (2024-02-13T15:35:24Z) - Submodular Reinforcement Learning [38.40138241424851]
In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are $textitindependent$ states visited previously.
In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously.
We propose $textitsubmodular RL$ (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns
arXiv Detail & Related papers (2023-07-25T09:46:02Z) - Robustness and risk management via distributional dynamic programming [13.173307471333619]
We introduce a new class of distributional operators, together with a practical DP algorithm for policy evaluation.
Our approach reformulates through an augmented state space where each state is split into a worst-case substate and a best-case substate.
We derive distributional operators and DP algorithms solving a new control task.
arXiv Detail & Related papers (2021-12-28T12:12:57Z) - Distributional Reinforcement Learning for Multi-Dimensional Reward
Functions [91.88969237680669]
We introduce Multi-Dimensional Distributional DQN (MD3QN) to model the joint return distribution from multiple reward sources.
As a by-product of joint distribution modeling, MD3QN can capture the randomness in returns for each source of reward.
In experiments, our method accurately models the joint return distribution in environments with richly correlated reward functions.
arXiv Detail & Related papers (2021-10-26T11:24:23Z) - Bayesian Distributional Policy Gradients [2.28438857884398]
Distributional Reinforcement Learning maintains the entire probability distribution of the reward-to-go, i.e. the return.
Bayesian Distributional Policy Gradients (BDPG) uses adversarial training in joint-contrastive learning to estimate a variational posterior from the returns.
arXiv Detail & Related papers (2021-03-20T23:42:50Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Likelihood-Free Inference with Deep Gaussian Processes [70.74203794847344]
Surrogate models have been successfully used in likelihood-free inference to decrease the number of simulator evaluations.
We propose a Deep Gaussian Process (DGP) surrogate model that can handle more irregularly behaved target distributions.
Our experiments show how DGPs can outperform GPs on objective functions with multimodal distributions and maintain a comparable performance in unimodal cases.
arXiv Detail & Related papers (2020-06-18T14:24:05Z) - Sample-based Distributional Policy Gradient [14.498314462218394]
We propose sample-based distributional policy gradient (SDPG) algorithm for continuous action space control settings.
We show that our algorithm shows better sample efficiency as well as higher reward for most tasks.
We apply SDPG and D4PG to multiple OpenAI Gym environments and observe that our algorithm shows better sample efficiency as well as higher reward for most tasks.
arXiv Detail & Related papers (2020-01-08T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.