On the Model-Misspecification in Reinforcement Learning
- URL: http://arxiv.org/abs/2306.10694v2
- Date: Sat, 6 Jan 2024 21:36:13 GMT
- Title: On the Model-Misspecification in Reinforcement Learning
- Authors: Yunfan Li and Lin Yang
- Abstract summary: We present a unified theoretical framework for addressing model misspecification in reinforcement learning.
We show that value-based and model-based methods can achieve robustness under local misspecification error bounds.
We also propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $zeta$.
- Score: 9.864462523050843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of reinforcement learning (RL) crucially depends on effective
function approximation when dealing with complex ground-truth models. Existing
sample-efficient RL algorithms primarily employ three approaches to function
approximation: policy-based, value-based, and model-based methods. However, in
the face of model misspecification (a disparity between the ground-truth and
optimal function approximators), it is shown that policy-based approaches can
be robust even when the policy function approximation is under a large
locally-bounded misspecification error, with which the function class may
exhibit a $\Omega(1)$ approximation error in specific states and actions, but
remains small on average within a policy-induced state distribution. Yet it
remains an open question whether similar robustness can be achieved with
value-based and model-based approaches, especially with general function
approximation.
To bridge this gap, in this paper we present a unified theoretical framework
for addressing model misspecification in RL. We demonstrate that, through
meticulous algorithm design and sophisticated analysis, value-based and
model-based methods employing general function approximation can achieve
robustness under local misspecification error bounds. In particular, they can
attain a regret bound of $\widetilde{O}\left(\text{poly}(d H)(\sqrt{K} +
K\zeta) \right)$, where $d$ represents the complexity of the function class,
$H$ is the episode length, $K$ is the total number of episodes, and $\zeta$
denotes the local bound for misspecification error. Furthermore, we propose an
algorithmic framework that can achieve the same order of regret bound without
prior knowledge of $\zeta$, thereby enhancing its practical applicability.
Related papers
- Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Target Network and Truncation Overcome The Deadly triad in $Q$-Learning [7.532013242448151]
We propose a stable design for $Q$-learning with linear function approximation using target network and truncation.
Our result implies an $mathcalO(epsilon-2)$ sample complexity up to a function approximation error.
This is the first variant of $Q$-learning with linear function approximation that is provably stable without requiring strong assumptions or modifying the problem parameters.
arXiv Detail & Related papers (2022-03-05T00:54:58Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement
Learning? [30.065091907118827]
This work considers sample complexity of finding an $epsilon$-optimal policy in a Markov decision process (MDP)
We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver.
We show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-10-12T13:13:01Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Reinforcement Learning with General Value Function Approximation:
Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation.
We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure.
Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.