Related papers: Necessary and Sufficient Oracles: Toward a Computational Taxonomy For Reinforcement Learning

Necessary and Sufficient Oracles: Toward a Computational Taxonomy For Reinforcement Learning

URL: http://arxiv.org/abs/2502.08632v1
Date: Wed, 12 Feb 2025 18:47:13 GMT
Title: Necessary and Sufficient Oracles: Toward a Computational Taxonomy For Reinforcement Learning
Authors: Dhruv Rohatgi, Dylan J. Foster,
Abstract summary: We study the impact of the choice of supervised learning oracle on the computational complexity of reinforcement learning algorithms.<n>First, we identify two-context regression as a minimal oracle in the standard episodic access model.<n>Second, we identify one-context regression as a near-minimal oracle in the stronger reset access model.<n>Third, we broaden our focus to Low-Rank MDPs, where we give cryptographic evidence that the analogous oracle from the Block MDP setting is insufficient.
Score: 28.184175745050474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Algorithms for reinforcement learning (RL) in large state spaces crucially rely on supervised learning subroutines to estimate objects such as value functions or transition probabilities. Since only the simplest supervised learning problems can be solved provably and efficiently, practical performance of an RL algorithm depends on which of these supervised learning "oracles" it assumes access to (and how they are implemented). But which oracles are better or worse? Is there a minimal oracle? In this work, we clarify the impact of the choice of supervised learning oracle on the computational complexity of RL, as quantified by the oracle strength. First, for the task of reward-free exploration in Block MDPs in the standard episodic access model -- a ubiquitous setting for RL with function approximation -- we identify two-context regression as a minimal oracle, i.e. an oracle that is both necessary and sufficient (under a mild regularity assumption). Second, we identify one-context regression as a near-minimal oracle in the stronger reset access model, establishing a provable computational benefit of resets in the process. Third, we broaden our focus to Low-Rank MDPs, where we give cryptographic evidence that the analogous oracle from the Block MDP setting is insufficient.

Related papers

Fuzzy Logic Guided Reward Function Variation: An Oracle for Testing Reinforcement Learning Programs [2.871991859211386]
We propose an automated oracle approach that leverages RL properties using fuzzy logic. Our oracle quantifies an agent's behavioral compliance with reward policies and analyzes its trend over training episodes. It labels an RL program as "Buggy" if the compliance trend violates expectations derived from RL characteristics.
arXiv Detail & Related papers (2024-06-28T10:41:17Z)
Is Efficient PAC Learning Possible with an Oracle That Responds 'Yes' or 'No'? [26.334900941196082]
We investigate whether the ability to perform ERM, which computes a hypothesis minimizing empirical risk on a given dataset, is necessary for efficient learning. We show that in real setting of PAC for binary classification, a concept class can be learned using an oracle which only returns a single bit. Our results extend to the learning setting with a slight strengthening of the oracle, as well as to the partial concept, multiclass and real-valued learning settings.
arXiv Detail & Related papers (2024-06-17T15:50:08Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL. We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
The Power of Resets in Online Reinforcement Learning [73.64852266145387]
We explore the power of simulators through online reinforcement learning with local simulator access (or, local planning) We show that MDPs with low coverability can be learned in a sample-efficient fashion with only $Qstar$-realizability. We show that the notorious Exogenous Block MDP problem is tractable under local simulator access.
arXiv Detail & Related papers (2024-04-23T18:09:53Z)
Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning [34.791182995710024]
We show the first cryptographic separation between reinforcement learning and supervised learning. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs.
arXiv Detail & Related papers (2024-04-04T19:35:41Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Statistical curriculum learning: An elimination algorithm achieving an oracle risk [31.997825444285457]
We consider a statistical version of curriculum learning (CL) in a parametric prediction setting. We consider three types of learners, depending on the level of side-information they receive.
arXiv Detail & Related papers (2024-02-20T20:44:40Z)
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model [20.81421550138371]
We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference oracle. We show that this framework is strictly more general than the reward-based one, and propose sample-efficient algorithms for both the offline learning from a pre-collected preference dataset and online learning.
arXiv Detail & Related papers (2024-02-11T21:44:21Z)
Oracle Inequalities for Model Selection in Offline Reinforcement Learning [105.74139523696284]
We study the problem of model selection in offline RL with value function approximation. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal inequalities up to logarithmic factors. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
arXiv Detail & Related papers (2022-11-03T17:32:34Z)
Smoothed Online Learning is as Easy as Statistical Learning [77.00766067963195]
We provide the first oracle-efficient, no-regret algorithms in this setting. We show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits.
arXiv Detail & Related papers (2022-02-09T19:22:34Z)
Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles [112.89548995091182]
We provide the first universal and optimal reduction from contextual bandits to online regression. Our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.
arXiv Detail & Related papers (2020-02-12T11:33:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.