Related papers: Segmenting Action-Value Functions Over Time-Scales in SARSA using TD($Δ$)

Segmenting Action-Value Functions Over Time-Scales in SARSA using TD($Δ$)

URL: http://arxiv.org/abs/2411.14783v1
Date: Fri, 22 Nov 2024 07:52:28 GMT
Title: Segmenting Action-Value Functions Over Time-Scales in SARSA using TD($Δ$)
Authors: Mahammad Humayoo,
Abstract summary: This study expands the temporal difference decomposition approach, TD($triangle$), to the SARSA algorithm. TD($triangle$) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors. We illustrate that our methodology mitigates bias in SARSA's updates while accelerated convergence in contexts characterized by dense rewards.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In numerous episodic reinforcement learning (RL) settings, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Conventional SARSA algorithms, however, have difficulties in balancing bias and variation due to the reliance on a singular, fixed discount factor. This study expands the temporal difference decomposition approach, TD($\triangle$), to the SARSA algorithm. SARSA, a widely utilised on-policy RL method, enhances action-value functions via temporal difference updates. TD($\triangle$) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors. This decomposition improves learning efficiency and stability, particularly in problems necessitating long-horizon optimization. We illustrate that our methodology mitigates bias in SARSA's updates while facilitating accelerated convergence in contexts characterized by dense rewards. Experimental findings across many benchmark tasks indicate that the proposed SARSA($\triangle$) surpasses conventional TD learning methods in both tabular and deep RL contexts.

Related papers

Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z)
Stabilizing Temporal Difference Learning via Implicit Stochastic Recursion [2.1301560294088318]
Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL)<n>We propose implicit TD algorithms that reformulate TD updates into fixed point equations.<n>Our results show that implicit TD algorithms are applicable to a much broader range of step sizes.
arXiv Detail & Related papers (2025-05-02T15:57:54Z)
Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition [0.0]
This paper introduces Q($Delta$)-Learning, an extension of TD($Delta$) for the Q-Learning framework. TD($Delta$) facilitates efficient learning over several time scales by breaking the Q($Delta$)-function into distinct discount factors. We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q($Delta$)-Learning surpasses conventional Q-Learning and TD learning methods.
arXiv Detail & Related papers (2024-11-21T11:03:07Z)
Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales [13.818149654692863]
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss.
arXiv Detail & Related papers (2024-05-27T19:28:33Z)
Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL) We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$. The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
On Improving the Algorithm-, Model-, and Data- Efficiency of Self-Supervised Learning [18.318758111829386]
We propose an efficient single-branch SSL method based on non-parametric instance discrimination. We also propose a novel self-distillation loss that minimizes the KL divergence between the probability distribution and its square root version.
arXiv Detail & Related papers (2024-04-30T06:39:04Z)
Improve Robustness of Reinforcement Learning against Observation Perturbations via $l_\infty$ Lipschitz Policy Networks [8.39061976254379]
Deep Reinforcement Learning (DRL) has achieved remarkable advances in sequential decision tasks. Recent works have revealed that DRL agents are susceptible to slight perturbations in observations. We propose a novel robust reinforcement learning method called SortRL, which improves the robustness of DRL policies against observation perturbations.
arXiv Detail & Related papers (2023-12-14T08:57:22Z)
Blending Imitation and Reinforcement Learning for Robust Policy Improvement [16.588397203235296]
Imitation learning (IL) utilizes oracles to improve sample efficiency. RPI draws on the strengths of IL, using oracle queries to facilitate exploration. RPI is capable of learning from and improving upon a diverse set of black-box oracles.
arXiv Detail & Related papers (2023-10-03T01:55:54Z)
Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective. We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z)
Temperature Schedules for Self-Supervised Contrastive Methods on Long-Tail Data [87.77128754860983]
In this paper, we analyse the behaviour of one of the most popular variants of self-supervised learning (SSL) on long-tail data. We find that a large $tau$ emphasises group-wise discrimination, whereas a small $tau$ leads to a higher degree of instance discrimination. We propose to employ a dynamic $tau$ and show that a simple cosine schedule can yield significant improvements in the learnt representations.
arXiv Detail & Related papers (2023-03-23T20:37:25Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
Emphatic Algorithms for Deep Reinforcement Learning [43.17171330951343]
Temporal difference learning algorithms can become unstable when combined with function approximation and off-policy sampling. Emphatic temporal difference (ETD($lambda$) algorithm ensures convergence in the linear case by appropriately weighting the TD($lambda$) updates. We show that naively adapting ETD($lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance.
arXiv Detail & Related papers (2021-06-21T12:11:39Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
Deep Reinforcement Learning using Cyclical Learning Rates [62.19441737665902]
One of the most influential parameters in optimization procedures based on gradient descent (SGD) is the learning rate. We investigate cyclical learning and propose a method for defining a general cyclical learning rate for various DRL problems. Our experiments show that, utilizing cyclical learning achieves similar or even better results than highly tuned fixed learning rates.
arXiv Detail & Related papers (2020-07-31T10:06:02Z)
The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning [6.181642248900806]
Multi-step (also called n-step) methods in reinforcement learning have been shown to be more efficient than the 1-step method. We show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup. We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error.
arXiv Detail & Related papers (2020-06-23T01:35:54Z)
Distributional Robustness and Regularization in Reinforcement Learning [62.23012916708608]
We introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function. It suggests using regularization as a practical tool for dealing with $textitexternal uncertainty$ in reinforcement learning.
arXiv Detail & Related papers (2020-03-05T19:56:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.