The Curse of Passive Data Collection in Batch Reinforcement Learning
- URL: http://arxiv.org/abs/2106.09973v3
- Date: Wed, 5 Jul 2023 14:19:02 GMT
- Title: The Curse of Passive Data Collection in Batch Reinforcement Learning
- Authors: Chenjun Xiao, Ilbin Lee, Bo Dai, Dale Schuurmans, Csaba Szepesvari
- Abstract summary: In high stake applications, active experimentation may be considered too risky and thus data are often collected passively.
While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states.
- Score: 82.6026077420886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In high stake applications, active experimentation may be considered too
risky and thus data are often collected passively. While in simple cases, such
as in bandits, passive and active data collection are similarly effective, the
price of passive sampling can be much higher when collecting data from a system
with controlled states. The main focus of the current paper is the
characterization of this price. For example, when learning in episodic finite
state-action Markov decision processes (MDPs) with $\mathrm{S}$ states and
$\mathrm{A}$ actions, we show that even with the best (but passively chosen)
logging policy, $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$
episodes are necessary (and sufficient) to obtain an $\epsilon$-optimal policy,
where $H$ is the length of episodes. Note that this shows that the sample
complexity blows up exponentially compared to the case of active data
collection, a result which is not unexpected, but, as far as we know, have not
been published beforehand and perhaps the form of the exact expression is a
little surprising. We also extend these results in various directions, such as
other criteria or learning in the presence of function approximation, with
similar conclusions. A remarkable feature of our result is the sharp
characterization of the exponent that appears, which is critical for
understanding what makes passive learning hard.
Related papers
- Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.
We propose a new learning paradigm that integrates both paired and unpaired data.
Our approach also connects intriguingly with inverse entropic optimal transport (OT)
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge [0.704590071265998]
We study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently.
We present an optimistic Q-learning algorithm that achieves $tildemathcalO(textPoly(H)sqrtSAT)$ regret under perfect knowledge of $f$.
arXiv Detail & Related papers (2023-12-19T19:53:58Z) - Demonstration-Regularized RL [39.96273388393764]
Using expert demonstrations, we identify an optimal policy at a sample complexity of order $widetildeO(mathrmPoly(S,A,H)/(varepsilon2 NmathrmE)$ in finite and $widetildeO(mathrmPoly(d,H)/(varepsilon2 NmathrmE)$ in linear Markov decision processes.
We establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-10-26T10:54:47Z) - Multi-Task Imitation Learning for Linear Dynamical Systems [50.124394757116605]
We study representation learning for efficient imitation learning over linear systems.
We find that the imitation gap over trajectories generated by the learned target policy is bounded by $tildeOleft( frack n_xHN_mathrmshared + frack n_uN_mathrmtargetright)$.
arXiv Detail & Related papers (2022-12-01T00:14:35Z) - Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten.
We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly.
SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z) - Tractable Optimality in Episodic Latent MABs [75.17357040707347]
We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps.
Depending on the length of the episode, the learner may not be able to estimate accurately the latent context.
We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
arXiv Detail & Related papers (2022-10-05T22:53:46Z) - Revealing Unobservables by Deep Learning: Generative Element Extraction
Networks (GEEN) [5.3028918247347585]
This paper proposes a novel method for estimating realizations of a latent variable $X*$ in a random sample.
To the best of our knowledge, this paper is the first to provide such identification in observation.
arXiv Detail & Related papers (2022-10-04T01:09:05Z) - A Few Expert Queries Suffices for Sample-Efficient RL with Resets and
Linear Value Approximation [16.29514743112387]
We study sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearlyrealizable.
We present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries.
Delphi requires $tildemathcalO(d)$ expert queries and a $textttpoly(d,|mathcalA|,1/varepsilon)$ amount of exploratory samples to provably recover an $varepsilon$suboptimal policy.
arXiv Detail & Related papers (2022-07-18T01:39:13Z) - Indirect Active Learning [7.84669346764821]
We study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point.
In many cases, while there is a benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.
arXiv Detail & Related papers (2022-06-03T08:37:35Z) - Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle.
We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.