Time Discretization-Invariant Safe Action Repetition for Policy Gradient
Methods
- URL: http://arxiv.org/abs/2111.03941v1
- Date: Sat, 6 Nov 2021 19:17:24 GMT
- Title: Time Discretization-Invariant Safe Action Repetition for Policy Gradient
Methods
- Authors: Seohong Park, Jaekyeom Kim, Gunhee Kim
- Abstract summary: We propose a $delta$-invariant algorithm for policy gradient (PG) methods.
We show that our method is not only $delta$-invariant but also robust toity, outperforming previous $delta$-invariant approaches.
- Score: 43.49494338665518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning, continuous time is often discretized by a time
scale $\delta$, to which the resulting performance is known to be highly
sensitive. In this work, we seek to find a $\delta$-invariant algorithm for
policy gradient (PG) methods, which performs well regardless of the value of
$\delta$. We first identify the underlying reasons that cause PG methods to
fail as $\delta \to 0$, proving that the variance of the PG estimator can
diverge to infinity in stochastic environments under a certain assumption of
stochasticity. While durative actions or action repetition can be employed to
have $\delta$-invariance, previous action repetition methods cannot immediately
react to unexpected situations in stochastic environments. We thus propose a
novel $\delta$-invariant method named Safe Action Repetition (SAR) applicable
to any existing PG algorithm. SAR can handle the stochasticity of environments
by adaptively reacting to changes in states during action repetition. We
empirically show that our method is not only $\delta$-invariant but also robust
to stochasticity, outperforming previous $\delta$-invariant approaches on eight
MuJoCo environments with both deterministic and stochastic settings. Our code
is available at https://vision.snu.ac.kr/projects/sar.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Discounted Thompson Sampling for Non-Stationary Bandit Problems [13.656518163592349]
Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention.
We propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings.
Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling.
arXiv Detail & Related papers (2023-05-18T05:29:52Z) - Improved Sample Complexity Bounds for Distributionally Robust
Reinforcement Learning [3.222802562733787]
We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment.
We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences.
We show that our algorithm achieves $tildemathcalO(|mathcalSmathcalA| H5)$ sample complexity, which is uniformly better than the existing results.
arXiv Detail & Related papers (2023-03-05T21:47:08Z) - Stochastic Policy Gradient Methods: Improved Sample Complexity for
Fisher-non-degenerate Policies [19.779044926914704]
We develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies.
In this work, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $tildemathcalO(varepsilon-2.5)$ sample complexity of this method.
We further improve this complexity to $tilde mathcalmathcalO (varepsilon-2)$ by considering a Hessian-Aided Recursive Policy Gradient
arXiv Detail & Related papers (2023-02-03T13:50:23Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Tractable Optimality in Episodic Latent MABs [75.17357040707347]
We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps.
Depending on the length of the episode, the learner may not be able to estimate accurately the latent context.
We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
arXiv Detail & Related papers (2022-10-05T22:53:46Z) - Simultaneously Learning Stochastic and Adversarial Bandits under the
Position-Based Model [9.945948163150874]
This work studies the online learning to rank problem in both and adversarial environments under the position-based model.
We prove the proposed algorithm simultaneously achieves $O(logT)$ regret in the adversarial environment and $O(msqrtnT)$ regret in the adversarial environment.
Experiments show that our algorithm could simultaneously learn in both and adversarial environments and is competitive compared to existing methods.
arXiv Detail & Related papers (2022-07-12T10:00:14Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.