Related papers: Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

URL: http://arxiv.org/abs/2410.13463v1
Date: Thu, 17 Oct 2024 11:47:56 GMT
Title: Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach
Authors: Riccardo Poiani, Nicole Nobili, Alberto Maria Metelli, Marcello Restelli,
Abstract summary: Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms. We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths. We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
Score: 51.76826149868971
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths, i.e., \emph{truncated}. Specifically, this surrogate shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate. Building on these findings, we present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO). The main intuition behind RIDO is to split the available interaction budget into mini-batches. At each round, the agent determines the most convenient schedule of trajectories that minimizes an empirical and robust version of the surrogate of the estimator's error. After discussing the theoretical properties of our method, we conclude by assessing its performance across multiple domains. Our results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation.

Related papers

Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling [40.94400211806987]
We propose a policy-driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy.<n>We show that learned sampling improves quality gradient information and relax the explicit dependence on $d$ in convergence bounds.<n>Our results suggest that adaptive direction sampling is a promising route to make ZO fine-tuning viable at scale.
arXiv Detail & Related papers (2026-02-14T08:01:41Z)
What If We Allocate Test-Time Compute Adaptively? [2.1713977971908944]
Test-time scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking.<n>We propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection.<n>Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling.
arXiv Detail & Related papers (2026-02-01T07:30:22Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Efficient MCMC Sampling with Expensive-to-Compute and Irregular Likelihoods [2.299872239734834]
We explore several sampling algorithms that make use of subset evaluations to reduce computational overhead.<n>We introduce data-driven proxies in place of Taylor expansions and define a novel-cost aware adaptive controller.<n>We find our improved version of Importance with Nested Training Samples (HINTS), with adaptive proposals and a data-driven proxy, obtains the best sampling error in a fixed computational budget.
arXiv Detail & Related papers (2025-05-15T16:06:44Z)
Zeroth-order Informed Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer [9.153197757307762]
probabilistic diffusion model (DM) is a powerful framework for visual generation. How to efficiently align the foundation DM is a crucial task. We propose the Recursive Likelihood Ratio (RLR), a zeroth-order informed fine-tuning paradigm for DM.
arXiv Detail & Related papers (2025-02-02T03:00:26Z)
Cost-Aware Query Policies in Active Learning for Efficient Autonomous Robotic Exploration [0.0]
This paper analyzes an AL algorithm for Gaussian Process regression while incorporating action cost. Traditional uncertainty metric with a distance constraint best minimizes root-mean-square error over trajectory distance.
arXiv Detail & Related papers (2024-10-31T18:35:03Z)
FLOPS: Forward Learning with OPtimal Sampling [1.694989793927645]
gradient-based computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling. We propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency.
arXiv Detail & Related papers (2024-10-08T12:16:12Z)
Contextual Linear Optimization with Partial Feedback [35.38485630117593]
We propose a class of offline learning algorithms for Contextual linear optimization (CLO) with different types of feedback.<n>We provide a novel fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of estimation methods.
arXiv Detail & Related papers (2024-05-26T13:27:27Z)
Non-ergodicity in reinforcement learning: robustness via ergodicity transformations [8.44491527275706]
Application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance. We argue that a fundamental issue contributing to this lack of robustness lies in the focus on the expected value of the return. We propose an algorithm for learning ergodicity from data and demonstrate its effectiveness in an instructive, non-ergodic environment.
arXiv Detail & Related papers (2023-10-17T15:13:33Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data [28.445166861907495]
We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator. We derive high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting.
arXiv Detail & Related papers (2023-06-24T21:48:28Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.