Provably Efficient Exploration in Reward Machines with Low Regret
- URL: http://arxiv.org/abs/2412.19194v1
- Date: Thu, 26 Dec 2024 12:25:04 GMT
- Title: Provably Efficient Exploration in Reward Machines with Low Regret
- Authors: Hippolyte Bourel, Anders Jonsson, Odalric-Ambrym Maillard, Chenxiao Ma, Mohammad Sadegh Talebi,
- Abstract summary: We study reinforcement learning for decision processes with non-Markovian reward.<n>Our main algorithmic contribution is a model-based RL algorithm for decision processes involving probabilistic reward machines.<n>We derive high-probability and non-asymptotic bounds on its regret and demonstrate the gain in terms of regret over existing algorithms.
- Score: 20.076030507802553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge of the task in the form of reward machines is available to the learner. We consider probabilistic reward machines with initially unknown dynamics, and investigate RL under the average-reward criterion, where the learning performance is assessed through the notion of regret. Our main algorithmic contribution is a model-based RL algorithm for decision processes involving probabilistic reward machines that is capable of exploiting the structure induced by such machines. We further derive high-probability and non-asymptotic bounds on its regret and demonstrate the gain in terms of regret over existing algorithms that could be applied, but obliviously to the structure. We also present a regret lower bound for the studied setting. To the best of our knowledge, the proposed algorithm constitutes the first attempt to tailor and analyze regret specifically for RL with probabilistic reward machines.
Related papers
- Reinforcement Learning with Stochastic Reward Machines [5.345748208068876]
We introduce a novel type of reward machines, called reward machines, and an algorithm for learning them.<n>Our algorithm, based on constraint solving, learns minimal reward machines from the explorations of a reinforcement learning agent.
arXiv Detail & Related papers (2025-10-16T16:12:04Z) - On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration: A Case Study on OneMax with (1+($λ$,$λ$))-GA [7.924445204088514]
We propose the application of a reward shaping mechanism to facilitate enhanced exploration of the environment by the RL agent.
Our work demonstrates the ability of RL in dynamically configuring the $(lambda,lambda)$-GA, but also confirms the advantages of reward shaping in the scalability of RL agents.
arXiv Detail & Related papers (2025-02-27T16:53:28Z) - BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.
We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.
We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - On Reward Structures of Markov Decision Processes [4.13365552362244]
A Markov decision process can be parameterized by a transition kernel and a reward function.
We study various kinds of "costs" associated with reinforcement learning inspired by the demands in robotic applications.
We develop a novel estimator with an instance-specific error bound to $tildeO(sqrtfractau_sn)$ for estimating a single state value.
arXiv Detail & Related papers (2023-08-28T22:29:16Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Reward-Machine-Guided, Self-Paced Reinforcement Learning [30.42334205249944]
We develop a self-paced reinforcement learning algorithm guided by reward machines.
The proposed algorithm achieves optimal behavior reliably even in cases in which existing baselines cannot make any meaningful progress.
It also decreases the curriculum length and reduces the variance in the curriculum generation process by up to one-fourth and four orders of magnitude, respectively.
arXiv Detail & Related papers (2023-05-25T22:13:37Z) - Risk-Aware Linear Bandits: Theory and Applications in Smart Order
Routing [10.69955834942979]
We consider risk-aware bandits optimization with applications in smart order routing (SOR)
Driven by the variance-minimizing globally-optimal (G-optimal) design, we propose the novel instance-independent Risk-Aware Explore-then-Commit (RISE) algorithm and the instance-dependent Risk-Aware Successive Elimination (RISE++) algorithm.
arXiv Detail & Related papers (2022-08-04T00:21:10Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven
Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems.
We propose a novel method for computing the normalized maximum likelihood (NML) distribution.
We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z) - Heuristic-Guided Reinforcement Learning [31.056460162389783]
Tabula rasa RL algorithms require environment interactions or computation that scales with the horizon of the decision-making task.
Our framework can be viewed as a horizon-based regularization for controlling bias and variance in RL under a finite interaction budget.
In particular, we introduce the novel concept of an "improvable" -- a that allows an RL agent to extrapolate beyond its prior knowledge.
arXiv Detail & Related papers (2021-06-05T00:04:09Z) - Efficient Model-Based Reinforcement Learning through Optimistic Policy
Search and Planning [93.1435980666675]
We show how optimistic exploration can be easily combined with state-of-the-art reinforcement learning algorithms.
Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions.
arXiv Detail & Related papers (2020-06-15T18:37:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.