An Analysis of Frame-skipping in Reinforcement Learning
- URL: http://arxiv.org/abs/2102.03718v1
- Date: Sun, 7 Feb 2021 04:59:09 GMT
- Title: An Analysis of Frame-skipping in Reinforcement Learning
- Authors: Shivaram Kalyanakrishnan, Siddharth Aravindan, Vishwajeet Bagdawat,
Varun Bhatt, Harshith Goka, Archit Gupta, Kalpesh Krishna, Vihari Piratla
- Abstract summary: On many Atari console games, reinforcement learning algorithms deliver substantially better policies when run with $d > 1$.
We focus on "action-repetition", the common restriction of this choice to $d$-length sequences of the same action.
We show that this loss may be offset by the gain brought to learning by a smaller task horizon.
- Score: 13.680685626360903
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the practice of sequential decision making, agents are often designed to
sense state at regular intervals of $d$ time steps, $d > 1$, ignoring state
information in between sensing steps. While it is clear that this practice can
reduce sensing and compute costs, recent results indicate a further benefit. On
many Atari console games, reinforcement learning (RL) algorithms deliver
substantially better policies when run with $d > 1$ -- in fact with $d$ even as
high as $180$. In this paper, we investigate the role of the parameter $d$ in
RL; $d$ is called the "frame-skip" parameter, since states in the Atari domain
are images. For evaluating a fixed policy, we observe that under standard
conditions, frame-skipping does not affect asymptotic consistency. Depending on
other parameters, it can possibly even benefit learning. To use $d > 1$ in the
control setting, one must first specify which $d$-step open-loop action
sequences can be executed in between sensing steps. We focus on
"action-repetition", the common restriction of this choice to $d$-length
sequences of the same action. We define a task-dependent quantity called the
"price of inertia", in terms of which we upper-bound the loss incurred by
action-repetition. We show that this loss may be offset by the gain brought to
learning by a smaller task horizon. Our analysis is supported by experiments on
different tasks and learning algorithms.
Related papers
- Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)
Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.
We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z) - Swarm Behavior Cloning [4.9854403800887415]
In sequential decision-making environments, the primary approaches for training agents are Reinforcement Learning (RL) and Imitation Learning (IL)
This paper addresses the issue of increasing action differences -- the observation that discrepancies between the $N$ predicted actions grow in states that are underrepresented in the training data.
We propose a method that fosters greater alignment among the policies while preserving the diversity of their computations.
arXiv Detail & Related papers (2024-12-10T15:54:57Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Reinforcement Learning in a Birth and Death Process: Breaking the
Dependence on the State Space [0.0]
We revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure.
In our main result, we show that the regret of a slightly-ted version of the classical learning algorithm sc Ucrl2 is in fact upper bounded by $tildemathcalO(sqrtEAT)$ where $E$ is related to the weighted second moment of the stationary measure of a reference policy.
arXiv Detail & Related papers (2023-02-21T13:28:37Z) - Near-Optimal Adversarial Reinforcement Learning with Switching Costs [43.895798638743784]
We show how to develop a provably efficient algorithm for adversarial RL with switching costs.
Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret is no longer achievable.
We propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known.
arXiv Detail & Related papers (2023-02-08T23:41:29Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost [31.04961854943877]
We propose a new algorithm that achieves a regret of $widetildeO(sqrtH4S2AT)$ while requiring a switching cost of $O(HSA loglog T)$.
As a byproduct, our new algorithmic techniques allow us to derive a emphreward-free exploration algorithm with an optimal switching cost of $O(HSA)$.
arXiv Detail & Related papers (2022-02-13T19:01:06Z) - Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Revisiting Smoothed Online Learning [70.09792747315323]
We investigate the problem of smoothed online learning, in which the online learner suffers both a hitting cost and a switching cost.
To bound the competitive ratio, we assume the hitting cost is known to the learner in each round, and investigate the greedy algorithm which simply minimizes the weighted sum of the hitting cost and the switching cost.
arXiv Detail & Related papers (2021-02-13T14:15:55Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.