MetaCURL: Non-stationary Concave Utility Reinforcement Learning
- URL: http://arxiv.org/abs/2405.19807v1
- Date: Thu, 30 May 2024 08:17:00 GMT
- Title: MetaCURL: Non-stationary Concave Utility Reinforcement Learning
- Authors: Bianca Marin Moreno, Margaux Brégère, Pierre Gaillard, Nadia Oudjane,
- Abstract summary: We explore online learning in episodic loop-free Markov decision processes on non-stationary environments.
This paper introduces MetaCURL, the first CURL algorithm for non-stationary MDPs.
- Score: 8.230945193151399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore online learning in episodic loop-free Markov decision processes on non-stationary environments (changing losses and probability transitions). Our focus is on the Concave Utility Reinforcement Learning problem (CURL), an extension of classical RL for handling convex performance criteria in state-action distributions induced by agent policies. While various machine learning problems can be written as CURL, its non-linearity invalidates traditional Bellman equations. Despite recent solutions to classical CURL, none address non-stationary MDPs. This paper introduces MetaCURL, the first CURL algorithm for non-stationary MDPs. It employs a meta-algorithm running multiple black-box algorithms instances over different intervals, aggregating outputs via a sleeping expert framework. The key hurdle is partial information due to MDP uncertainty. Under partial information on the probability transitions (uncertainty and non-stationarity coming only from external noise, independent of agent state-action pairs), we achieve optimal dynamic regret without prior knowledge of MDP changes. Unlike approaches for RL, MetaCURL handles full adversarial losses, not just stochastic ones. We believe our approach for managing non-stationarity with experts can be of interest to the RL community.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Inverse Concave-Utility Reinforcement Learning is Inverse Game Theory [17.62475351325657]
We consider inverse reinforcement learning problems with concave utilities.
We show that most of the standard IRL results do not apply to CURL in general, since CURL invalidates the classical Bellman equations.
We propose a new definition for the feasible rewards for I-CURL by proving that this problem is equivalent to an inverse game theory problem in a subclass of mean-field games.
arXiv Detail & Related papers (2024-05-29T12:07:17Z) - Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Efficient Model-Based Concave Utility Reinforcement Learning through
Greedy Mirror Descent [0.0]
Concave Utility Reinforcement Learning problem invalidates classical Bellman equations.
We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process.
We present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting.
arXiv Detail & Related papers (2023-11-30T08:32:50Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating
The Worst Kernel [46.373217780462944]
We present EWoK, a novel online approach to solve RMDP that Estimates the Worst transition Kernel to learn robust policies.
EWoK achieves robustness by simulating the worst scenarios for the agent while retaining complete flexibility in the learning process.
Our experiments, spanning from simple Cartpole to high-dimensional DeepMind Control Suite environments, demonstrate the effectiveness and applicability of EWoK.
arXiv Detail & Related papers (2023-06-09T12:45:41Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - Continual Test-Time Domain Adaptation [94.51284735268597]
Test-time domain adaptation aims to adapt a source pre-trained model to a target domain without using any source data.
CoTTA is easy to implement and can be readily incorporated in off-the-shelf pre-trained models.
arXiv Detail & Related papers (2022-03-25T11:42:02Z) - Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ.
We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments.
We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z) - Learning and Solving Regular Decision Processes [15.533842336139067]
Regular Decision Processes (RDPs) are a recently introduced model that extends MDPs with non-Markovian dynamics and rewards.
We build on automata learning techniques with history clustering to learn such a Mealy machine and solve it by adapting MCTS.
arXiv Detail & Related papers (2020-03-02T16:36:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.