Highway Reinforcement Learning
- URL: http://arxiv.org/abs/2405.18289v1
- Date: Tue, 28 May 2024 15:42:45 GMT
- Title: Highway Reinforcement Learning
- Authors: Yuhui Wang, Miroslav Strupl, Francesco Faccio, Qingyuan Wu, Haozhe Liu, Michał Grudzień, Xiaoyang Tan, Jürgen Schmidhuber,
- Abstract summary: Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL)
We introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF.
It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large.
- Score: 35.980387097763035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of $n$. We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple but non-trivial \emph{highway gate}, which controls the information flow from the distant future by comparing it to a threshold. The highway gate guarantees convergence to the optimal VF for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past. On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.
Related papers
- An Empirical Study of Lagrangian Methods in Safe Reinforcement Learning [0.9802224811027896]
In safety-critical domains, constrained optimization problems arise where maximizing performance must be carefully balanced with associated constraints.<n>Lagrangian methods are a popular choice to address these challenges.<n>We show that the effectiveness of Lagrangian methods crucially depends on the choice of the Lagrange multiplier $lambda$.
arXiv Detail & Related papers (2025-10-20T14:13:17Z) - Deep Reinforcement Learning with Gradient Eligibility Traces [28.93284550303061]
In this paper, we extend the generalized $overlinetextPBE$ objective to support multistep credit assignment based on the $lambda$-return.<n>We provide both a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms.<n>We evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments.
arXiv Detail & Related papers (2025-07-12T00:12:05Z) - Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards [17.695285420477035]
We study the intermediate range of algorithms between off-policy RL and supervised fine-tuning.<n>We first provide a theoretical analysis of this off-policy REINFORCE algorithm.<n>Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones.
arXiv Detail & Related papers (2025-06-25T15:07:16Z) - Learning Diverse Policies with Soft Self-Generated Guidance [2.9602904918952695]
Reinforcement learning with sparse and deceptive rewards is challenging because non-zero rewards are rarely obtained.
This paper develops an approach that uses diverse past trajectories for faster and more efficient online RL.
arXiv Detail & Related papers (2024-02-07T02:53:50Z) - Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z) - Robust Predictable Control [149.71263296079388]
We show that our method achieves much tighter compression than prior methods, achieving up to 5x higher reward than a standard information bottleneck.
We also demonstrate that our method learns policies that are more robust and generalize better to new tasks.
arXiv Detail & Related papers (2021-09-07T17:29:34Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Greedy Multi-step Off-Policy Reinforcement Learning [14.720255341733413]
We propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps.
Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-02-23T14:32:20Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.