Related papers: An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

URL: http://arxiv.org/abs/2106.00922v1
Date: Wed, 2 Jun 2021 03:45:43 GMT
Title: An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task
Authors: Sina Ghiassian, Richard S. Sutton
Abstract summary: Off-policy prediction -- learning the value function for one policy from data generated while following another policy -- is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation.
Score: 9.207173776826403
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy prediction -- learning the value function for one policy from data generated while following another policy -- is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD($\lambda$), Vtrace, and versions of Tree Backup and ABQ modified to apply to a prediction setting. Our experiments used the Collision task, a small idealized off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. We assessed the performance of the algorithms according to their learning rate, asymptotic error level, and sensitivity to step-size and bootstrapping parameters. By these measures, the eleven algorithms can be partially ordered on the Collision task. In the top tier, the two Emphatic-TD algorithms learned the fastest, reached the lowest errors, and were robust to parameter settings. In the middle tier, the five Gradient-TD algorithms and Off-policy TD($\lambda$) were more sensitive to the bootstrapping parameter. The bottom tier comprised Vtrace, Tree Backup, and ABQ; these algorithms were no faster and had higher asymptotic error than the others. Our results are definitive for this task, though of course experiments with more tasks are needed before an overall assessment of the algorithms' merits can be made.

Related papers

Unsupervised Learning of Initialization in Deep Neural Networks via Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data. We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification. We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z)
AWD3: Dynamic Reduction of the Estimation Bias [0.0]
We introduce a technique that eliminates the estimation bias in off-policy continuous control algorithms using the experience replay mechanism. We show through continuous control environments of OpenAI gym that our algorithm matches or outperforms the state-of-the-art off-policy policy gradient learning algorithms.
arXiv Detail & Related papers (2021-11-12T15:46:19Z)
Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms. For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime. In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem. We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z)
An Empirical Comparison of Off-policy Prediction Learning Algorithms in the Four Rooms Environment [9.207173776826403]
We empirically compare 11 off-policy prediction learning algorithms with linear function approximation on two small tasks. The algorithms' performance is highly affected by the variance induced by the importance sampling ratios. Emphatic TD($lambda$) tends to have lower error than other algorithms, but might learn more slowly in some cases.
arXiv Detail & Related papers (2021-09-10T21:15:41Z)
Emphatic Algorithms for Deep Reinforcement Learning [43.17171330951343]
Temporal difference learning algorithms can become unstable when combined with function approximation and off-policy sampling. Emphatic temporal difference (ETD($lambda$) algorithm ensures convergence in the linear case by appropriately weighting the TD($lambda$) updates. We show that naively adapting ETD($lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance.
arXiv Detail & Related papers (2021-06-21T12:11:39Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z)
Learning and Planning in Average-Reward Markov Decision Processes [15.586087060535398]
We introduce learning and planning algorithms for average-reward MDPs. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward.
arXiv Detail & Related papers (2020-06-29T19:03:24Z)
Exploiting Higher Order Smoothness in Derivative-free Optimization and Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function. We consider a randomized approximation of the projected gradient descent algorithm. Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.