Related papers: Adaptive Exploration for Data-Efficient General Value Function Evaluations

Adaptive Exploration for Data-Efficient General Value Function Evaluations

URL: http://arxiv.org/abs/2405.07838v2
Date: Sun, 13 Oct 2024 15:54:10 GMT
Title: Adaptive Exploration for Data-Efficient General Value Function Evaluations
Authors: Arushi Jain, Josiah P. Hanna, Doina Precup,
Abstract summary: General Value Functions (GVFs) represent predictive knowledge in reinforcement learning. GVFExplorer learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel.
Score: 40.156127789708265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: General Value Functions (GVFs) (Sutton et al., 2011) represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique reward. Existing methods relying on fixed behavior policies or pre-collected data often face data efficiency issues when learning multiple GVFs in parallel using off-policy methods. To address this, we introduce GVFExplorer, which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel. Our method optimizes the behavior policy by minimizing the total variance in return across GVFs, thereby reducing the required environmental interactions. We use an existing temporal-difference-style variance estimator to approximate the return variance. We prove that each behavior policy update decreases the overall mean squared error in GVF predictions. We empirically show our method's performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs.

Related papers

Neighbour-Driven Gaussian Process Variational Autoencoders for Scalable Structured Latent Modelling [14.358070928996069]
Gaussian Process (GP) Variational Autoencoders (VAEs) extend standard VAEs by replacing the fully factorised Gaussian prior with a GP prior.<n> performing exact GP inference in large-scale GPVAEs is computationally prohibitive, often forcing existing approaches to rely on restrictive kernel assumptions.<n>We propose a neighbour-driven approximation strategy that exploits local adjacencies in the latent space to achieve scalable GPVAE inference.
arXiv Detail & Related papers (2025-05-22T10:07:33Z)
Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data [3.6714630660726586]
offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data.<n>Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes.<n>We propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes.
arXiv Detail & Related papers (2025-05-14T15:44:10Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples. We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework. We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z)
Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy. Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z)
Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning. The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z)
A Unified Off-Policy Evaluation Approach for General Value Function [131.45028999325797]
General Value Function (GVF) is a powerful tool to represent both predictive and retrospective knowledge in reinforcement learning (RL) In this paper, we propose a new algorithm called GenTD for off-policy GVFs evaluation. We show that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function.
arXiv Detail & Related papers (2021-07-06T16:20:34Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.