Truncating Trajectories in Monte Carlo Reinforcement Learning
- URL: http://arxiv.org/abs/2305.04361v1
- Date: Sun, 7 May 2023 19:41:57 GMT
- Title: Truncating Trajectories in Monte Carlo Reinforcement Learning
- Authors: Riccardo Poiani, Alberto Maria Metelli, Marcello Restelli
- Abstract summary: In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal.
We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths.
We show that an appropriate truncation of the trajectories can succeed in improving performance.
- Score: 48.97155920826079
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Reinforcement Learning (RL), an agent acts in an unknown environment to
maximize the expected cumulative discounted sum of an external reward signal,
i.e., the expected return. In practice, in many tasks of interest, such as
policy optimization, the agent usually spends its interaction budget by
collecting episodes of fixed length within a simulator (i.e., Monte Carlo
simulation). However, given the discounted nature of the RL objective, this
data collection strategy might not be the best option. Indeed, the rewards
taken in early simulation steps weigh exponentially more than future rewards.
Taking a cue from this intuition, in this paper, we design an a-priori budget
allocation strategy that leads to the collection of trajectories of different
lengths, i.e., truncated. The proposed approach provably minimizes the width of
the confidence intervals around the empirical estimates of the expected return
of a policy. After discussing the theoretical properties of our method, we make
use of our trajectory truncation mechanism to extend Policy Optimization via
Importance Sampling (POIS, Metelli et al., 2018) algorithm. Finally, we conduct
a numerical comparison between our algorithm and POIS: the results are
consistent with our theory and show that an appropriate truncation of the
trajectories can succeed in improving performance.
Related papers
- Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms.
We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths.
We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories [8.429001045596687]
We represent the learning process of an RL algorithm as a sequence of policies generated during training.
We then study the policy trajectory induced in the manifold of state-action occupancy measures.
arXiv Detail & Related papers (2024-02-14T11:55:50Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Soft policy optimization using dual-track advantage estimator [5.4020749513539235]
This paper introduces the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation.
We propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm.
Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method achieves the most advanced results in cumulative return.
arXiv Detail & Related papers (2020-09-15T04:09:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.