Related papers: Truncating Trajectories in Monte Carlo Reinforcement Learning

Truncating Trajectories in Monte Carlo Reinforcement Learning

URL: http://arxiv.org/abs/2305.04361v1
Date: Sun, 7 May 2023 19:41:57 GMT
Title: Truncating Trajectories in Monte Carlo Reinforcement Learning
Authors: Riccardo Poiani, Alberto Maria Metelli, Marcello Restelli
Abstract summary: In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
Score: 48.97155920826079
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal, i.e., the expected return. In practice, in many tasks of interest, such as policy optimization, the agent usually spends its interaction budget by collecting episodes of fixed length within a simulator (i.e., Monte Carlo simulation). However, given the discounted nature of the RL objective, this data collection strategy might not be the best option. Indeed, the rewards taken in early simulation steps weigh exponentially more than future rewards. Taking a cue from this intuition, in this paper, we design an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths, i.e., truncated. The proposed approach provably minimizes the width of the confidence intervals around the empirical estimates of the expected return of a policy. After discussing the theoretical properties of our method, we make use of our trajectory truncation mechanism to extend Policy Optimization via Importance Sampling (POIS, Metelli et al., 2018) algorithm. Finally, we conduct a numerical comparison between our algorithm and POIS: the results are consistent with our theory and show that an appropriate truncation of the trajectories can succeed in improving performance.

Related papers

A Differential Perspective on Distributional Reinforcement Learning [7.028778922533688]
We extend distributional reinforcement learning to the average-reward setting, where an agent aims to optimize the reward received per time-step.<n>In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution.
arXiv Detail & Related papers (2025-06-03T19:26:25Z)
Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation [28.63391989014238]
Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time.<n>We propose a model-based algorithm that achieves both sample and computational efficiency.<n>We show that a near-optimal policy can be learned with a suboptimality gap of $tildeO(sqrtd_mathcalR + d_mathcalFN-1/2)$ using $N$ measurements.
arXiv Detail & Related papers (2025-05-20T18:37:51Z)
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms. We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths. We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z)
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories [8.429001045596687]
We represent the learning process of an RL algorithm as a sequence of policies generated during training. We then study the policy trajectory induced in the manifold of state-action occupancy measures.
arXiv Detail & Related papers (2024-02-14T11:55:50Z)
Reinforcement learning with non-ergodic reward increments: robustness via ergodicity transformations [8.44491527275706]
Application areas for reinforcement learning include autonomous driving, precision agriculture, and finance. In particular, the focus of RL is typically on the expected value of the return. We develop an algorithm that lets RL agents optimize the long-term performance of individual trajectories.
arXiv Detail & Related papers (2023-10-17T15:13:33Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Soft policy optimization using dual-track advantage estimator [5.4020749513539235]
This paper introduces the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation. We propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm. Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method achieves the most advanced results in cumulative return.
arXiv Detail & Related papers (2020-09-15T04:09:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.