Segmenting Action-Value Functions Over Time-Scales in SARSA using TD($Δ$)
- URL: http://arxiv.org/abs/2411.14783v1
- Date: Fri, 22 Nov 2024 07:52:28 GMT
- Title: Segmenting Action-Value Functions Over Time-Scales in SARSA using TD($Δ$)
- Authors: Mahammad Humayoo,
- Abstract summary: This study expands the temporal difference decomposition approach, TD($triangle$), to the SARSA algorithm.
TD($triangle$) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors.
We illustrate that our methodology mitigates bias in SARSA's updates while accelerated convergence in contexts characterized by dense rewards.
- Score: 0.0
- License:
- Abstract: In numerous episodic reinforcement learning (RL) settings, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Conventional SARSA algorithms, however, have difficulties in balancing bias and variation due to the reliance on a singular, fixed discount factor. This study expands the temporal difference decomposition approach, TD($\triangle$), to the SARSA algorithm. SARSA, a widely utilised on-policy RL method, enhances action-value functions via temporal difference updates. TD($\triangle$) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors. This decomposition improves learning efficiency and stability, particularly in problems necessitating long-horizon optimization. We illustrate that our methodology mitigates bias in SARSA's updates while facilitating accelerated convergence in contexts characterized by dense rewards. Experimental findings across many benchmark tasks indicate that the proposed SARSA($\triangle$) surpasses conventional TD learning methods in both tabular and deep RL contexts.
Related papers
- Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition [0.0]
This paper introduces Q($Delta$)-Learning, an extension of TD($Delta$) for the Q-Learning framework.
TD($Delta$) facilitates efficient learning over several time scales by breaking the Q($Delta$)-function into distinct discount factors.
We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q($Delta$)-Learning surpasses conventional Q-Learning and TD learning methods.
arXiv Detail & Related papers (2024-11-21T11:03:07Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Improve Robustness of Reinforcement Learning against Observation
Perturbations via $l_\infty$ Lipschitz Policy Networks [8.39061976254379]
Deep Reinforcement Learning (DRL) has achieved remarkable advances in sequential decision tasks.
Recent works have revealed that DRL agents are susceptible to slight perturbations in observations.
We propose a novel robust reinforcement learning method called SortRL, which improves the robustness of DRL policies against observation perturbations.
arXiv Detail & Related papers (2023-12-14T08:57:22Z) - Blending Imitation and Reinforcement Learning for Robust Policy
Improvement [16.588397203235296]
Imitation learning (IL) utilizes oracles to improve sample efficiency.
RPI draws on the strengths of IL, using oracle queries to facilitate exploration.
RPI is capable of learning from and improving upon a diverse set of black-box oracles.
arXiv Detail & Related papers (2023-10-03T01:55:54Z) - Temperature Schedules for Self-Supervised Contrastive Methods on
Long-Tail Data [87.77128754860983]
In this paper, we analyse the behaviour of one of the most popular variants of self-supervised learning (SSL) on long-tail data.
We find that a large $tau$ emphasises group-wise discrimination, whereas a small $tau$ leads to a higher degree of instance discrimination.
We propose to employ a dynamic $tau$ and show that a simple cosine schedule can yield significant improvements in the learnt representations.
arXiv Detail & Related papers (2023-03-23T20:37:25Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Emphatic Algorithms for Deep Reinforcement Learning [43.17171330951343]
Temporal difference learning algorithms can become unstable when combined with function approximation and off-policy sampling.
Emphatic temporal difference (ETD($lambda$) algorithm ensures convergence in the linear case by appropriately weighting the TD($lambda$) updates.
We show that naively adapting ETD($lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance.
arXiv Detail & Related papers (2021-06-21T12:11:39Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Deep Reinforcement Learning using Cyclical Learning Rates [62.19441737665902]
One of the most influential parameters in optimization procedures based on gradient descent (SGD) is the learning rate.
We investigate cyclical learning and propose a method for defining a general cyclical learning rate for various DRL problems.
Our experiments show that, utilizing cyclical learning achieves similar or even better results than highly tuned fixed learning rates.
arXiv Detail & Related papers (2020-07-31T10:06:02Z) - The Effect of Multi-step Methods on Overestimation in Deep Reinforcement
Learning [6.181642248900806]
Multi-step (also called n-step) methods in reinforcement learning have been shown to be more efficient than the 1-step method.
We show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup.
We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error.
arXiv Detail & Related papers (2020-06-23T01:35:54Z) - Distributional Robustness and Regularization in Reinforcement Learning [62.23012916708608]
We introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function.
It suggests using regularization as a practical tool for dealing with $textitexternal uncertainty$ in reinforcement learning.
arXiv Detail & Related papers (2020-03-05T19:56:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.