Segmenting Action-Value Functions Over Time-Scales in SARSA via TD($Δ$)
- URL: http://arxiv.org/abs/2411.14783v2
- Date: Sat, 04 Jan 2025 03:57:50 GMT
- Title: Segmenting Action-Value Functions Over Time-Scales in SARSA via TD($Δ$)
- Authors: Mahammad Humayoo,
- Abstract summary: This study expands the temporal difference decomposition approach, TD($Delta$), to the SARSA algorithm.
TD($Delta$) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors.
- Score: 0.0
- License:
- Abstract: In numerous episodic reinforcement learning (RL) settings, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Conventional SARSA algorithms, however, have difficulties in balancing bias and variation due to the reliance on a singular, fixed discount factor. This study expands the temporal difference decomposition approach, TD($\Delta$), to the SARSA algorithm, which we designate as SARSA($\Delta$). SARSA, a widely utilised on-policy RL method, enhances action-value functions via temporal difference updates. TD($\Delta$) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors. This decomposition improves learning efficiency and stability, particularly in problems necessitating long-horizon optimization. We illustrate that our methodology mitigates bias in SARSA's updates while facilitating accelerated convergence in both deterministic and stochastic environments. Experimental findings across many benchmark tasks indicate that the proposed SARSA($\Delta$) surpasses conventional TD learning methods in both tabular and deep RL environments.
Related papers
- Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition [0.0]
This paper introduces Q($Delta$)-Learning, an extension of TD($Delta$) for the Q-Learning framework.
TD($Delta$) facilitates efficient learning over several time scales by breaking the Q($Delta$)-function into distinct discount factors.
We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q($Delta$)-Learning surpasses conventional Q-Learning and TD learning methods.
arXiv Detail & Related papers (2024-11-21T11:03:07Z) - Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales [13.818149654692863]
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance.
In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss.
arXiv Detail & Related papers (2024-05-27T19:28:33Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - On Improving the Algorithm-, Model-, and Data- Efficiency of Self-Supervised Learning [18.318758111829386]
We propose an efficient single-branch SSL method based on non-parametric instance discrimination.
We also propose a novel self-distillation loss that minimizes the KL divergence between the probability distribution and its square root version.
arXiv Detail & Related papers (2024-04-30T06:39:04Z) - Blending Imitation and Reinforcement Learning for Robust Policy
Improvement [16.588397203235296]
Imitation learning (IL) utilizes oracles to improve sample efficiency.
RPI draws on the strengths of IL, using oracle queries to facilitate exploration.
RPI is capable of learning from and improving upon a diverse set of black-box oracles.
arXiv Detail & Related papers (2023-10-03T01:55:54Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Temperature Schedules for Self-Supervised Contrastive Methods on
Long-Tail Data [87.77128754860983]
In this paper, we analyse the behaviour of one of the most popular variants of self-supervised learning (SSL) on long-tail data.
We find that a large $tau$ emphasises group-wise discrimination, whereas a small $tau$ leads to a higher degree of instance discrimination.
We propose to employ a dynamic $tau$ and show that a simple cosine schedule can yield significant improvements in the learnt representations.
arXiv Detail & Related papers (2023-03-23T20:37:25Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - The Effect of Multi-step Methods on Overestimation in Deep Reinforcement
Learning [6.181642248900806]
Multi-step (also called n-step) methods in reinforcement learning have been shown to be more efficient than the 1-step method.
We show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup.
We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error.
arXiv Detail & Related papers (2020-06-23T01:35:54Z) - Distributional Robustness and Regularization in Reinforcement Learning [62.23012916708608]
We introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function.
It suggests using regularization as a practical tool for dealing with $textitexternal uncertainty$ in reinforcement learning.
arXiv Detail & Related papers (2020-03-05T19:56:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.