Auto-exploration for online reinforcement learning
- URL: http://arxiv.org/abs/2512.06244v1
- Date: Sat, 06 Dec 2025 02:04:50 GMT
- Title: Auto-exploration for online reinforcement learning
- Authors: Caleb Ju, Guanghui Lan,
- Abstract summary: exploration-exploitation dilemma in reinforcement learning is a fundamental challenge to efficient RL algorithms.<n>Existing algorithms for finite state and action discounted RL problems address this by assuming sufficient exploration over both state and action spaces.<n>We introduce a new class of methods with auto-exploration, or methods that automatically explore both state and action spaces in a parameter-free way.
- Score: 1.2788899058467404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The exploration-exploitation dilemma in reinforcement learning (RL) is a fundamental challenge to efficient RL algorithms. Existing algorithms for finite state and action discounted RL problems address this by assuming sufficient exploration over both state and action spaces. However, this yields non-implementable algorithms and sub-optimal performance. To resolve these limitations, we introduce a new class of methods with auto-exploration, or methods that automatically explore both state and action spaces in a parameter-free way, i.e.,~without a priori knowledge of problem-dependent parameters. We present two variants: one for the tabular setting and one for linear function approximation. Under algorithm-independent assumptions on the existence of an exploring optimal policy, both methods attain $O(ε^{-2})$ sample complexity to solve to $ε$ error. Crucially, these complexities are novel since they are void of algorithm-dependent parameters seen in prior works, which may be arbitrarily large. The methods are also simple to implement because they are parameter-free and do not directly estimate the unknown parameters. These feats are achieved by new algorithmic innovations for RL, including a dynamic mixing time, a discounted state distribution for sampling, a simple robust gradient estimator, and a recent advantage gap function to certify convergence.
Related papers
- Adaptive Resolving Methods for Reinforcement Learning with Function Approximations [4.168629519090361]
We develop a new algorithm to solve theReinforcement learning problems with function approximation.<n>Our algorithm is based on the linear programming (LP) reformulation and it resolves the LP at each improved with new data arrival.<n>In comparison to the $O(1/sqrtN)$ worst-case guarantee established in the previous literature, our instance-dependent guarantee is tighter when the underlying instance is favorable.
arXiv Detail & Related papers (2025-05-17T14:59:15Z) - A Clean Slate for Offline Reinforcement Learning [30.87055102715522]
offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs.<n>We introduce a rigorous taxonomy and a transparent evaluation protocol that explicitly quantifies online tuning budgets.<n>We develop two novel algorithms - TD3-AWR (model-free) and MoBRAC (model-based) - which substantially outperform established baselines.
arXiv Detail & Related papers (2025-04-15T17:59:05Z) - Tractable Offline Learning of Regular Decision Processes [50.11277112628193]
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs)
Ins, the unknown dependency of future observations and rewards from the past interactions can be captured experimentally.
Many algorithms first reconstruct this unknown dependency using automata learning techniques.
arXiv Detail & Related papers (2024-09-04T14:26:58Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Online Learning Under A Separable Stochastic Approximation Framework [20.26530917721778]
We propose an online learning algorithm for a class of machine learning models under a separable approximation framework.
We show that the proposed algorithm produces more robust and test performance when compared to other popular learning algorithms.
arXiv Detail & Related papers (2023-05-12T13:53:03Z) - Stochastic Direct Search Method for Blind Resource Allocation [6.574808513848414]
We study direct search (also known as pattern search) methods for linearly constrained and derivative-free optimization.
We show that direct search methods achieves finite regret in the deterministic and unconstrained case.
We propose a simple extension of direct search that achieves a regret upper-bound of the order of $T2/3$.
arXiv Detail & Related papers (2022-10-11T07:40:45Z) - Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency [111.83670279016599]
We study reinforcement learning for partially observed decision processes (POMDPs) with infinite observation and state spaces.
We make the first attempt at partial observability and function approximation for a class of POMDPs with a linear structure.
arXiv Detail & Related papers (2022-04-20T21:15:38Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Byzantine-Resilient Non-Convex Stochastic Gradient Descent [61.6382287971982]
adversary-resilient distributed optimization, in which.
machines can independently compute gradients, and cooperate.
Our algorithm is based on a new concentration technique, and its sample complexity.
It is very practical: it improves upon the performance of all prior methods when no.
setting machines are present.
arXiv Detail & Related papers (2020-12-28T17:19:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.