Related papers: The Curse of Passive Data Collection in Batch Reinforcement Learning

The Curse of Passive Data Collection in Batch Reinforcement Learning

URL: http://arxiv.org/abs/2106.09973v3
Date: Wed, 5 Jul 2023 14:19:02 GMT
Title: The Curse of Passive Data Collection in Batch Reinforcement Learning
Authors: Chenjun Xiao, Ilbin Lee, Bo Dai, Dale Schuurmans, Csaba Szepesvari
Abstract summary: In high stake applications, active experimentation may be considered too risky and thus data are often collected passively. While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states.
Score: 82.6026077420886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In high stake applications, active experimentation may be considered too risky and thus data are often collected passively. While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states. The main focus of the current paper is the characterization of this price. For example, when learning in episodic finite state-action Markov decision processes (MDPs) with $\mathrm{S}$ states and $\mathrm{A}$ actions, we show that even with the best (but passively chosen) logging policy, $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$ episodes are necessary (and sufficient) to obtain an $\epsilon$-optimal policy, where $H$ is the length of episodes. Note that this shows that the sample complexity blows up exponentially compared to the case of active data collection, a result which is not unexpected, but, as far as we know, have not been published beforehand and perhaps the form of the exact expression is a little surprising. We also extend these results in various directions, such as other criteria or learning in the presence of function approximation, with similar conclusions. A remarkable feature of our result is the sharp characterization of the exponent that appears, which is critical for understanding what makes passive learning hard.

Related papers

Top-$nσ$: Not All Logits Are You Need [25.133593066927794]
We introduce top-$nsigma$, a novel sampling method that operates directly on pre-softmax logits. We show that top-$nsigma$ maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top-$nsigma$ to better understand its behavior.
arXiv Detail & Related papers (2024-11-12T08:46:43Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning. We propose a new learning paradigm that integrates both paired and unpaired data. Our approach also connects intriguingly with inverse entropic optimal transport (OT)
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge [0.704590071265998]
We study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently. We present an optimistic Q-learning algorithm that achieves $tildemathcalO(textPoly(H)sqrtSAT)$ regret under perfect knowledge of $f$.
arXiv Detail & Related papers (2023-12-19T19:53:58Z)
Multi-Task Imitation Learning for Linear Dynamical Systems [50.124394757116605]
We study representation learning for efficient imitation learning over linear systems. We find that the imitation gap over trajectories generated by the learned target policy is bounded by $tildeOleft( frack n_xHN_mathrmshared + frack n_uN_mathrmtargetright)$.
arXiv Detail & Related papers (2022-12-01T00:14:35Z)
Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten. We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly. SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z)
Tractable Optimality in Episodic Latent MABs [75.17357040707347]
We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
arXiv Detail & Related papers (2022-10-05T22:53:46Z)
Revealing Unobservables by Deep Learning: Generative Element Extraction Networks (GEEN) [5.3028918247347585]
This paper proposes a novel method for estimating realizations of a latent variable $X*$ in a random sample. To the best of our knowledge, this paper is the first to provide such identification in observation.
arXiv Detail & Related papers (2022-10-04T01:09:05Z)
A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation [16.29514743112387]
We study sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearlyrealizable. We present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries. Delphi requires $tildemathcalO(d)$ expert queries and a $textttpoly(d,|mathcalA|,1/varepsilon)$ amount of exploratory samples to provably recover an $varepsilon$suboptimal policy.
arXiv Detail & Related papers (2022-07-18T01:39:13Z)
Indirect Active Learning [7.84669346764821]
We study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point. In many cases, while there is a benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.
arXiv Detail & Related papers (2022-06-03T08:37:35Z)
Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle. We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.