Related papers: A Variance Minimization Approach to Temporal-Difference Learning

A Variance Minimization Approach to Temporal-Difference Learning

URL: http://arxiv.org/abs/2411.06396v1
Date: Sun, 10 Nov 2024 08:56:16 GMT
Title: A Variance Minimization Approach to Temporal-Difference Learning
Authors: Xingguo Chen, Yu Gong, Shangdong Yang, Wenhao Wang,
Abstract summary: This paper introduces a variance minimization (VM) approach for value-based RL instead of error minimization. Based on this approach, we proposed two objectives, the Variance of Bellman Error (VBE) and the Variance of Projected Bellman Error (VPBE)
Score: 12.026021568207206
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Fast-converging algorithms are a contemporary requirement in reinforcement learning. In the context of linear function approximation, the magnitude of the smallest eigenvalue of the key matrix is a major factor reflecting the convergence speed. Traditional value-based RL algorithms focus on minimizing errors. This paper introduces a variance minimization (VM) approach for value-based RL instead of error minimization. Based on this approach, we proposed two objectives, the Variance of Bellman Error (VBE) and the Variance of Projected Bellman Error (VPBE), and derived the VMTD, VMTDC, and VMETD algorithms. We provided proofs of their convergence and optimal policy invariance of the variance minimization. Experimental studies validate the effectiveness of the proposed algorithms.

Related papers

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning [55.33984461046492]
Policy-based methods currently dominate reinforcement learning pipelines for large language model (LLM) reasoning.<n>We introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs.<n>We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy via an improved change-of-trajectory-measure analysis.
arXiv Detail & Related papers (2025-05-21T09:41:53Z)
Deep Minimax Classifiers for Imbalanced Datasets with a Small Number of Minority Samples [5.217870815854702]
We propose a novel minimax learning algorithm designed to minimize the risk of worst-performing classes. Our proposed algorithm has a provable convergence property, and empirical results indicate that our algorithm performs better than or is comparable to existing methods.
arXiv Detail & Related papers (2025-02-24T08:20:02Z)
Solving Hidden Monotone Variational Inequalities with Surrogate Losses [23.565183680315073]
We propose a principled surrogate-based approach compatible with deep learning to solve variational inequality (VI) problems. We demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. In the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.
arXiv Detail & Related papers (2024-11-07T22:42:08Z)
A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
Parameterized Projected Bellman Operator [64.129598593852]
Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) We propose a novel alternative approach based on learning an approximate version of the Bellman operator. We formulate an optimization problem to learn PBO for generic sequential decision-making problems.
arXiv Detail & Related papers (2023-12-20T09:33:16Z)
Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice [79.48432795639403]
Mirror descent value iteration (MDVI) is an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL) We study MDVI with linear function approximation through its sample complexity required to identify an $varepsilon$-optimal policy. We present Variance-Weighted Least-Squares MDVI, the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs.
arXiv Detail & Related papers (2023-05-22T16:13:05Z)
Accelerated and instance-optimal policy evaluation with linear function approximation [17.995515643150657]
Existing algorithms fail to match at least one of these lower bounds. We develop an accelerated, variance-reduced fast temporal difference algorithm that simultaneously matches both lower bounds and attains a strong notion of instance-optimality.
arXiv Detail & Related papers (2021-12-24T17:21:04Z)
Learning with Multiclass AUC: Theory and Algorithms [141.63211412386283]
Area under the ROC curve (AUC) is a well-known ranking metric for problems such as imbalanced learning and recommender systems. In this paper, we start an early trial to consider the problem of learning multiclass scoring functions via optimizing multiclass AUC metrics.
arXiv Detail & Related papers (2021-07-28T05:18:10Z)
A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning [25.39784277231972]
We introduce a new generalized MSPBE that extends the linear MSPBE to the nonlinear setting. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective.
arXiv Detail & Related papers (2021-04-28T15:50:34Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.