Nonlinear Multi-objective Reinforcement Learning with Provable
Guarantees
- URL: http://arxiv.org/abs/2311.02544v2
- Date: Thu, 14 Dec 2023 20:33:01 GMT
- Title: Nonlinear Multi-objective Reinforcement Learning with Provable
Guarantees
- Authors: Nianli Peng and Brandon Fain
- Abstract summary: RA-E3 is an algorithm with provable guarantees for solving a single or multi-objective Decision Process (MDP)
RA-E3 extends the classic E3 algorithm that solves MDPs with scalar rewards and linear preferences.
- Score: 1.6160052850279416
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We describe RA-E3 (Reward-Aware Explicit Explore or Exploit), an algorithm
with provable guarantees for solving a single or multi-objective Markov
Decision Process (MDP) where we want to maximize the expected value of a
nonlinear function over accumulated rewards. This allows us to model
fairness-aware welfare optimization for multi-objective reinforcement learning
as well as risk-aware reinforcement learning with nonlinear Von
Neumann-Morgenstern utility functions in the single objective setting. RA-E3
extends the classic E3 algorithm that solves MDPs with scalar rewards and
linear preferences. We first state a distinct reward-aware version of value
iteration that calculates a non-stationary policy that is approximately optimal
for a given model of the environment. This sub-procedure is based on an
extended form of Bellman optimality for nonlinear optimization that explicitly
considers time and current accumulated reward. We then describe how to use this
optimization procedure in a larger algorithm that must simultaneously learn a
model of the environment. The algorithm learns an approximately optimal policy
in time that depends polynomially on the MDP size, desired approximation, and
smoothness of the nonlinear function, and exponentially on the number of
objectives.
Related papers
- Spectral-factorized Positive-definite Curvature Learning for NN Training [39.296923519945814]
Training methods such as Adam(W) and Shampoo learn a positive-definite curvature matrix and apply an inverse root before preconditioning.
We propose a Riemannian optimization approach that dynamically adapts spectral-factorized positive-definite curvature estimates.
arXiv Detail & Related papers (2025-02-10T09:07:04Z) - Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning [53.97335841137496]
We propose an oracle-efficient algorithm, dubbed Pessimistic Least-Square Value Iteration (PNLSVI) for offline RL with non-linear function approximation.
Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation.
arXiv Detail & Related papers (2023-10-02T17:42:01Z) - Inference on Optimal Dynamic Policies via Softmax Approximation [27.396891119011215]
We show that a simple soft-max approximation to the optimal treatment regime can achieve valid inference on the truly optimal regime.
Our work combines techniques from semi-parametric inference and $g$-estimation, together with an appropriate array central limit theorem.
arXiv Detail & Related papers (2023-03-08T07:42:47Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO)
PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Sparse Bayesian Learning via Stepwise Regression [1.2691047660244335]
We propose a coordinate ascent algorithm for SBL termed Relevance Matching Pursuit (RMP)
As its noise variance parameter goes to zero, RMP exhibits a surprising connection to Stepwise Regression.
We derive novel guarantees for Stepwise Regression algorithms, which also shed light on RMP.
arXiv Detail & Related papers (2021-06-11T00:20:27Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - Stochastic Learning Approach to Binary Optimization for Optimal Design
of Experiments [0.0]
We present a novel approach to binary optimization for optimal experimental design (OED) for Bayesian inverse problems governed by mathematical models such as partial differential equations.
The OED utility function, namely, the regularized optimality gradient, is cast into an objective function in the form of an expectation over a Bernoulli distribution.
The objective is then solved by using a probabilistic optimization routine to find an optimal observational policy.
arXiv Detail & Related papers (2021-01-15T03:54:12Z) - A maximum-entropy approach to off-policy evaluation in average-reward
MDPs [54.967872716145656]
This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs)
We provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases.
We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning.
arXiv Detail & Related papers (2020-06-17T18:13:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.