Online Episodic Convex Reinforcement Learning
- URL: http://arxiv.org/abs/2505.07303v1
- Date: Mon, 12 May 2025 07:47:49 GMT
- Title: Online Episodic Convex Reinforcement Learning
- Authors: Bianca Marin Moreno, Khaled Eldowa, Pierre Gaillard, Margaux Brégère, Nadia Oudjane,
- Abstract summary: We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions.<n>We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function.<n>We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy.
- Score: 8.650735248242727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent's policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use an online mirror descent algorithm with varying constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.
Related papers
- Minimax-optimal and Locally-adaptive Online Nonparametric Regression [10.138723409205497]
We study adversarial online nonparametric regression with general convex losses.<n>We propose a parameter-free learning algorithm that achieves minimax optimal rates.<n>We discuss how these notions could be extended to a boosting framework.
arXiv Detail & Related papers (2024-10-04T12:30:03Z) - Efficient Model-Based Concave Utility Reinforcement Learning through
Greedy Mirror Descent [0.0]
Concave Utility Reinforcement Learning problem invalidates classical Bellman equations.
We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process.
We present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting.
arXiv Detail & Related papers (2023-11-30T08:32:50Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Inverse Reinforcement Learning With Constraint Recovery [3.8073142980732992]
We propose a novel inverse reinforcement learning (IRL) algorithm for constrained decision process (CMDP) problems.
We demonstrate the efficacy of our algorithm for the grid world environment.
arXiv Detail & Related papers (2023-05-14T11:49:37Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Model-based Safe Deep Reinforcement Learning via a Constrained Proximal
Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner.
We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z) - Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning.
We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing.
We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Primal-dual Learning for the Model-free Risk-constrained Linear
Quadratic Regulator [0.8629912408966145]
Risk-aware control, though with promise to tackle unexpected events, requires a known exact dynamical model.
We propose a model framework to learn a risk-aware controller with a focus on the linear system.
arXiv Detail & Related papers (2020-11-22T04:40:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.