Related papers: $f$-Policy Gradients: A General Framework for Goal Conditioned RL using $f$-Divergences

$f$-Policy Gradients: A General Framework for Goal Conditioned RL using $f$-Divergences

URL: http://arxiv.org/abs/2310.06794v1
Date: Tue, 10 Oct 2023 17:07:05 GMT
Title: $f$-Policy Gradients: A General Framework for Goal Conditioned RL using $f$-Divergences
Authors: Siddhant Agarwal, Ishan Durugkar, Peter Stone, Amy Zhang
Abstract summary: This paper introduces a novel way to encourage exploration called $f$-Policy Gradients. We show that $f$-PG has better performance compared to standard policy methods on a challenging gridworld.
Score: 44.91973620442546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Goal-Conditioned Reinforcement Learning (RL) problems often have access to sparse rewards where the agent receives a reward signal only when it has achieved the goal, making policy optimization a difficult problem. Several works augment this sparse reward with a learned dense reward function, but this can lead to sub-optimal policies if the reward is misaligned. Moreover, recent works have demonstrated that effective shaping rewards for a particular problem can depend on the underlying learning algorithm. This paper introduces a novel way to encourage exploration called $f$-Policy Gradients, or $f$-PG. $f$-PG minimizes the f-divergence between the agent's state visitation distribution and the goal, which we show can lead to an optimal policy. We derive gradients for various f-divergences to optimize this objective. Our learning paradigm provides dense learning signals for exploration in sparse reward settings. We further introduce an entropy-regularized policy optimization objective, that we call $state$-MaxEnt RL (or $s$-MaxEnt RL) as a special case of our objective. We show that several metric-based shaping rewards like L2 can be used with $s$-MaxEnt RL, providing a common ground to study such metric-based shaping rewards with efficient exploration. We find that $f$-PG has better performance compared to standard policy gradient methods on a challenging gridworld as well as the Point Maze and FetchReach environments. More information on our website https://agarwalsiddhant10.github.io/projects/fpg.html.

Related papers

Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg. The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment. Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z)
To the Max: Reinventing Reward in Reinforcement Learning [1.5498250598583487]
In reinforcement learning (RL), different reward functions can define the same optimal policy but result in drastically different learning performance. We introduce textitmax-reward RL, where an agent optimize the maximum rather than the cumulative reward. In experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics.
arXiv Detail & Related papers (2024-02-02T12:29:18Z)
Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning [26.067411894141863]
An appropriate reward function is of paramount importance in specifying a task in reinforcement learning (RL) Human-in-the-loop (HiL) RL allows humans to communicate complex goals to the RL agent by providing various types of feedback. We provide an active-learning-based RL algorithm that first explores the environment without specifying a reward function.
arXiv Detail & Related papers (2023-04-18T12:36:09Z)
Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning [17.239062061431646]
This paper studies reward-agnostic exploration in reinforcement learning (RL) Consider a finite-horizon inhomogeneous decision process with $S$ states, $A$ actions, and a horizon length $H$. Our algorithm is able to yield $varepsilon$ accuracy for arbitrarily many reward functions.
arXiv Detail & Related papers (2023-04-14T17:46:49Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z)
Adaptive Reward-Free Exploration [48.98199700043158]
We show that our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994. We further investigate the relative complexities of reward-free exploration and best-policy identification.
arXiv Detail & Related papers (2020-06-11T09:58:03Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration. We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration. We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.