Reinforcement Learning with Non-Exponential Discounting
- URL: http://arxiv.org/abs/2209.13413v1
- Date: Tue, 27 Sep 2022 14:13:16 GMT
- Title: Reinforcement Learning with Non-Exponential Discounting
- Authors: Matthias Schultheis, Constantin A. Rothkopf, Heinz Koeppl
- Abstract summary: We propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions.
Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.
- Score: 28.092095671829508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Commonly in reinforcement learning (RL), rewards are discounted over time
using an exponential function to model time preference, thereby bounding the
expected long-term reward. In contrast, in economics and psychology, it has
been shown that humans often adopt a hyperbolic discounting scheme, which is
optimal when a specific task termination time distribution is assumed. In this
work, we propose a theory for continuous-time model-based reinforcement
learning generalized to arbitrary discount functions. This formulation covers
the case in which there is a non-exponential random termination time. We derive
a Hamilton-Jacobi-Bellman (HJB) equation characterizing the optimal policy and
describe how it can be solved using a collocation method, which uses deep
learning for function approximation. Further, we show how the inverse RL
problem can be approached, in which one tries to recover properties of the
discount function given decision data. We validate the applicability of our
proposed approach on two simulated problems. Our approach opens the way for the
analysis of human discounting in sequential decision-making tasks.
Related papers
- Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference.
The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time
Reinforcement Learning [39.07307690074323]
We consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time environment.
Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, state representations, multiagent coordination, and more.
We propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm.
arXiv Detail & Related papers (2022-05-24T16:33:54Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Exponential Family Model-Based Reinforcement Learning via Score Matching [97.31477125728844]
We propose an optimistic model-based algorithm, dubbed SMRL, for finitehorizon episodic reinforcement learning (RL)
SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression.
arXiv Detail & Related papers (2021-12-28T15:51:07Z) - Exploration-exploitation trade-off for continuous-time episodic
reinforcement learning with linear-convex models [2.503869683354711]
We study finite-time horizon control problems with linear dynamics but unknown coefficients and convex, but possibly irregular, objective function.
We identify conditions under which this performance gap is quadratic, improving the linear performance gap in recent work.
Next, we propose a phase-based learning algorithm for which we show how to optimise exploration-exploitation trade-off and achieve sublinear regrets.
arXiv Detail & Related papers (2021-12-19T21:47:04Z) - A Generalised Inverse Reinforcement Learning Framework [24.316047317028147]
inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories.
We introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem.
The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
arXiv Detail & Related papers (2021-05-25T10:30:45Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Nested-Wasserstein Self-Imitation Learning for Sequence Generation [158.19606942252284]
We propose the concept of nested-Wasserstein distance for distributional semantic matching.
A novel nested-Wasserstein self-imitation learning framework is developed, encouraging the model to exploit historical high-rewarded sequences.
arXiv Detail & Related papers (2020-01-20T02:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.