Truncating Trajectories in Monte Carlo Reinforcement Learning
- URL: http://arxiv.org/abs/2305.04361v1
- Date: Sun, 7 May 2023 19:41:57 GMT
- Title: Truncating Trajectories in Monte Carlo Reinforcement Learning
- Authors: Riccardo Poiani, Alberto Maria Metelli, Marcello Restelli
- Abstract summary: In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal.
We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths.
We show that an appropriate truncation of the trajectories can succeed in improving performance.
- Score: 48.97155920826079
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Reinforcement Learning (RL), an agent acts in an unknown environment to
maximize the expected cumulative discounted sum of an external reward signal,
i.e., the expected return. In practice, in many tasks of interest, such as
policy optimization, the agent usually spends its interaction budget by
collecting episodes of fixed length within a simulator (i.e., Monte Carlo
simulation). However, given the discounted nature of the RL objective, this
data collection strategy might not be the best option. Indeed, the rewards
taken in early simulation steps weigh exponentially more than future rewards.
Taking a cue from this intuition, in this paper, we design an a-priori budget
allocation strategy that leads to the collection of trajectories of different
lengths, i.e., truncated. The proposed approach provably minimizes the width of
the confidence intervals around the empirical estimates of the expected return
of a policy. After discussing the theoretical properties of our method, we make
use of our trajectory truncation mechanism to extend Policy Optimization via
Importance Sampling (POIS, Metelli et al., 2018) algorithm. Finally, we conduct
a numerical comparison between our algorithm and POIS: the results are
consistent with our theory and show that an appropriate truncation of the
trajectories can succeed in improving performance.
Related papers
- Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Learning Merton's Strategies in an Incomplete Market: Recursive Entropy
Regularization and Biased Gaussian Exploration [11.774563966512709]
We take the reinforcement learning (RL) approach to learn optimal portfolio policies directly by exploring the unknown market.
We present an analysis of the resulting errors to show how the level of exploration affects the learned policies.
arXiv Detail & Related papers (2023-12-19T02:14:13Z) - Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory
Weighting [29.21380944341589]
We show that state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit trajectories to the fullest.
This reweighted sampling strategy may be combined with any offline RL algorithm.
We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms fully exploit the dataset.
arXiv Detail & Related papers (2023-06-22T17:58:02Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Soft policy optimization using dual-track advantage estimator [5.4020749513539235]
This paper introduces the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation.
We propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm.
Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method achieves the most advanced results in cumulative return.
arXiv Detail & Related papers (2020-09-15T04:09:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.