Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition
- URL: http://arxiv.org/abs/2411.14019v1
- Date: Thu, 21 Nov 2024 11:03:07 GMT
- Title: Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition
- Authors: Mahammad Humayoo,
- Abstract summary: This paper introduces Q($Delta$)-Learning, an extension of TD($Delta$) for the Q-Learning framework.
TD($Delta$) facilitates efficient learning over several time scales by breaking the Q($Delta$)-function into distinct discount factors.
We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q($Delta$)-Learning surpasses conventional Q-Learning and TD learning methods.
- Score: 0.0
- License:
- Abstract: Q-Learning is a fundamental off-policy reinforcement learning (RL) algorithm that has the objective of approximating action-value functions in order to learn optimal policies. Nonetheless, it has difficulties in reconciling bias with variance, particularly in the context of long-term rewards. This paper introduces Q($\Delta$)-Learning, an extension of TD($\Delta$) for the Q-Learning framework. TD($\Delta$) facilitates efficient learning over several time scales by breaking the Q($\Delta$)-function into distinct discount factors. This approach offers improved learning stability and scalability, especially for long-term tasks where discounting bias may impede convergence. Our methodology guarantees that each element of the Q($\Delta$)-function is acquired individually, facilitating expedited convergence on shorter time scales and enhancing the learning of extended time scales. We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q($\Delta$)-Learning surpasses conventional Q-Learning and TD learning methods in both tabular and deep RL environments.
Related papers
- Segmenting Action-Value Functions Over Time-Scales in SARSA using TD($Δ$) [0.0]
This study expands the temporal difference decomposition approach, TD($triangle$), to the SARSA algorithm.
TD($triangle$) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors.
We illustrate that our methodology mitigates bias in SARSA's updates while accelerated convergence in contexts characterized by dense rewards.
arXiv Detail & Related papers (2024-11-22T07:52:28Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Sequence Compression Speeds Up Credit Assignment in Reinforcement Learning [33.28797183140384]
Temporal difference (TD) learning uses bootstrapping to overcome variance but introduces a bias that can only be corrected through many iterations.
We propose Chunked-TD, which uses predicted probabilities of transitions from a model for computing $lambda$-return targets.
arXiv Detail & Related papers (2024-05-06T21:49:29Z) - Prediction and Control in Continual Reinforcement Learning [39.30411018922005]
Temporal difference (TD) learning is often used to update the estimate of the value function which is used by RL agents to extract useful policies.
We propose to decompose the value function into two components which update at different timescales.
arXiv Detail & Related papers (2023-12-18T19:23:42Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Discerning Temporal Difference Learning [5.439020425819001]
Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL)
We propose a novel TD algorithm named discerning TD learning (DTD)
arXiv Detail & Related papers (2023-10-12T07:38:10Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates [110.92598350897192]
Q-Learning has proven effective at learning a policy to perform control tasks.
estimation noise becomes a bias after the max operator in the policy improvement step.
We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
arXiv Detail & Related papers (2021-10-28T00:07:19Z) - IQ-Learn: Inverse soft-Q Learning for Imitation [95.06031307730245]
imitation learning from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics.
Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence.
We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function.
arXiv Detail & Related papers (2021-06-23T03:43:10Z) - Hierarchical Reinforcement Learning as a Model of Human Task
Interleaving [60.95424607008241]
We develop a hierarchical model of supervisory control driven by reinforcement learning.
The model reproduces known empirical effects of task interleaving.
The results support hierarchical RL as a plausible model of task interleaving.
arXiv Detail & Related papers (2020-01-04T17:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.