Quasimetric Value Functions with Dense Rewards
- URL: http://arxiv.org/abs/2409.08724v1
- Date: Fri, 13 Sep 2024 11:26:05 GMT
- Title: Quasimetric Value Functions with Dense Rewards
- Authors: Khadichabonu Valieva, Bikramjit Banerjee,
- Abstract summary: We show that the key property underpinning a quasimetric, viz., the triangle inequality, is preserved under a dense reward setting.
Dense reward functions that satisfy this condition can only improve, never worsen, sample complexity.
This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits to sample complexity.
- Score: 1.6574413179773761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics. Recent work has established that the optimal value function of GCRL $Q^\ast(s,a,g)$ has a quasimetric structure, leading to targetted neural architectures that respect such structure. However, the relevant analyses assume a sparse reward setting -- a known aggravating factor to sample complexity. We show that the key property underpinning a quasimetric, viz., the triangle inequality, is preserved under a dense reward setting as well. Contrary to earlier findings where dense rewards were shown to be detrimental to GCRL, we identify the key condition necessary for triangle inequality. Dense reward functions that satisfy this condition can only improve, never worsen, sample complexity. This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits to sample complexity. We evaluate this proposal in 12 standard benchmark environments in GCRL featuring challenging continuous control tasks. Our empirical results confirm that training a quasimetric value function in our dense reward setting indeed outperforms training with sparse rewards.
Related papers
- OGBench: Benchmarking Offline Goal-Conditioned RL [72.00291801676684]
offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning.
We propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL.
arXiv Detail & Related papers (2024-10-26T06:06:08Z) - Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg.
The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment.
Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z) - The Effective Horizon Explains Deep RL Performance in Stochastic Environments [21.148001945560075]
Reinforcement learning (RL) theory has largely focused on proving mini complexity sample bounds.
We introduce a new RL algorithm, SQIRL, that iteratively learns a nearoptimal policy by exploring randomly to collect rollouts.
We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" look-ahead and on the complexity of the class used for approximation.
arXiv Detail & Related papers (2023-12-13T18:58:56Z) - STARC: A General Framework For Quantifying Differences Between Reward
Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process.
We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z) - Provably Feedback-Efficient Reinforcement Learning via Active Reward
Learning [26.067411894141863]
An appropriate reward function is of paramount importance in specifying a task in reinforcement learning (RL)
Human-in-the-loop (HiL) RL allows humans to communicate complex goals to the RL agent by providing various types of feedback.
We provide an active-learning-based RL algorithm that first explores the environment without specifying a reward function.
arXiv Detail & Related papers (2023-04-18T12:36:09Z) - Reward Learning as Doubly Nonparametric Bandits: Optimal Design and
Scaling Laws [22.099915149343957]
We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem.
We first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression.
We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate.
arXiv Detail & Related papers (2023-02-23T22:07:33Z) - Metric Residual Networks for Sample Efficient Goal-conditioned
Reinforcement Learning [52.59242013527014]
Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications.
Sample efficiency is of utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal.
We introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture.
arXiv Detail & Related papers (2022-08-17T08:04:41Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.