Related papers: Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent

URL: http://arxiv.org/abs/2402.10228v5
Date: Fri, 14 Jun 2024 04:51:07 GMT
Title: Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent
Authors: Yingru Li, Jiawei Xu, Lei Han, Zhi-Quan Luo,
Abstract summary: HyperAgent is a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. We demonstrate that HyperAgent offers robust performance in large-scale deep RL benchmarks. It can solve Deep Sea hard exploration problems with episodes that optimally scale with problem size and exhibits significant efficiency gains in the Atari suite.
Score: 23.669599662214686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors associated with an optimal action-value function ($Q^\star$) without the need for conjugacy and follows the greedy policies w.r.t. these approximate posterior samples. We demonstrate that HyperAgent offers robust performance in large-scale deep RL benchmarks. It can solve Deep Sea hard exploration problems with episodes that optimally scale with problem size and exhibits significant efficiency gains in the Atari suite. Implementing HyperAgent requires minimal code addition to well-established deep RL frameworks like DQN. We theoretically prove that, under tabular assumptions, HyperAgent achieves logarithmic per-step computational complexity while attaining sublinear regret, matching the best known randomized tabular RL algorithm.

Related papers

ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning [50.53705050673944]
We propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet.
arXiv Detail & Related papers (2025-03-08T07:03:43Z)
Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based Exploration [20.76451379043945]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency.
arXiv Detail & Related papers (2025-02-02T04:40:04Z)
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning [42.33815055388433]
ARLBench is a benchmark for hyperparameter optimization (HPO) in reinforcement learning (RL) It allows comparisons of diverse HPO approaches while being highly efficient in evaluation. ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL.
arXiv Detail & Related papers (2024-09-27T15:22:28Z)
Adaptive Foundation Models for Online Decisions: HyperAgent with Fast Incremental Uncertainty Estimation [20.45450465931698]
GPT-HyperAgent is an augmentation of GPT with HyperAgent for uncertainty-aware, scalable exploration in contextual bandits. We prove that HyperAgent achieves fast incremental uncertainty estimation with $tildeO(log T)$ per-step computational complexity. Our analysis demonstrates that HyperAgent's regret order matches that of exact Thompson sampling in linear contextual bandits.
arXiv Detail & Related papers (2024-07-18T06:16:09Z)
The Effective Horizon Explains Deep RL Performance in Stochastic Environments [21.148001945560075]
Reinforcement learning (RL) theory has largely focused on proving mini complexity sample bounds. We introduce a new RL algorithm, SQIRL, that iteratively learns a nearoptimal policy by exploring randomly to collect rollouts. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" look-ahead and on the complexity of the class used for approximation.
arXiv Detail & Related papers (2023-12-13T18:58:56Z)
Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL) We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z)
Learning RL-Policies for Joint Beamforming Without Exploration: A Batch Constrained Off-Policy Approach [1.0080317855851213]
We consider the problem of network parameter cancellation optimization for networks. We show that deploying an algorithm in the real world for exploration and learning can be achieved with the data without exploring.
arXiv Detail & Related papers (2023-10-12T18:36:36Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Bridging RL Theory and Practice with the Effective Horizon [18.706109961534676]
We show that prior bounds do not correlate well with when deep RL succeeds vs. fails. We generalize this into a new complexity measure of an MDP that we call the effective horizon. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy.
arXiv Detail & Related papers (2023-04-19T17:59:01Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration. We show that this separation does not exist in the setting of linear MDPs. We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Maximum Entropy RL (Provably) Solves Some Robust RL Problems [94.80212602202518]
We prove theoretically that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function. Our results suggest that MaxEnt RL by itself is robust to certain disturbances, without requiring any additional modifications.
arXiv Detail & Related papers (2021-03-10T18:45:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.