Goal-conditioned Offline Reinforcement Learning through State Space Partitioning
- URL: http://arxiv.org/abs/2303.09367v2
- Date: Thu, 16 May 2024 17:07:44 GMT
- Title: Goal-conditioned Offline Reinforcement Learning through State Space Partitioning
- Authors: Mianchu Wang, Yue Jin, Giovanni Montana,
- Abstract summary: offline reinforcement learning (RL) aims to infer sequential decision policies using only offline datasets.
We argue that, despite its benefits, this approach is still insufficient to fully address the distribution shift and multi-modality problems.
We propose a complementary advantage-based weighting scheme that introduces an additional source of inductive bias.
- Score: 9.38848713730931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning (RL) aims to infer sequential decision policies using only offline datasets. This is a particularly difficult setup, especially when learning to achieve multiple different goals or outcomes under a given scenario with only sparse rewards. For offline learning of goal-conditioned policies via supervised learning, previous work has shown that an advantage weighted log-likelihood loss guarantees monotonic policy improvement. In this work we argue that, despite its benefits, this approach is still insufficient to fully address the distribution shift and multi-modality problems. The latter is particularly severe in long-horizon tasks where finding a unique and optimal policy that goes from a state to the desired goal is challenging as there may be multiple and potentially conflicting solutions. To tackle these challenges, we propose a complementary advantage-based weighting scheme that introduces an additional source of inductive bias: given a value-based partitioning of the state space, the contribution of actions expected to lead to target regions that are easier to reach, compared to the final goal, is further increased. Empirically, we demonstrate that the proposed approach, Dual-Advantage Weighted Offline Goal-conditioned RL (DAWOG), outperforms several competing offline algorithms in commonly used benchmarks. Analytically, we offer a guarantee that the learnt policy is never worse than the underlying behaviour policy.
Related papers
- Safe and Balanced: A Framework for Constrained Multi-Objective Reinforcement Learning [26.244121960815907]
We propose a primal-based framework that orchestrates policy optimization between multi-objective learning and constraint adherence.
Our method employs a novel natural policy gradient manipulation method to optimize multiple RL objectives.
Empirically, our proposed method also outperforms prior state-of-the-art methods on challenging safe multi-objective reinforcement learning tasks.
arXiv Detail & Related papers (2024-05-26T00:42:10Z) - Policy-regularized Offline Multi-objective Reinforcement Learning [11.58560880898882]
We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting.
We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness.
arXiv Detail & Related papers (2024-01-04T12:54:10Z) - GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models [31.628341050846768]
Goal-conditioned Offline Planning (GOPlan) is a novel model-based framework that contains two key phases.
GOPlan pretrains a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset.
The reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals.
arXiv Detail & Related papers (2023-10-30T21:19:52Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Imitating Graph-Based Planning with Goal-Conditioned Policies [72.61631088613048]
We present a self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy.
We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods.
arXiv Detail & Related papers (2023-03-20T14:51:10Z) - Offline Stochastic Shortest Path: Learning, Evaluation and Towards
Optimality [57.91411772725183]
In this paper, we consider the offline shortest path problem when the state space and the action space are finite.
We design the simple value-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks.
Our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal.
arXiv Detail & Related papers (2022-06-10T07:44:56Z) - Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in
Latent Space [76.46113138484947]
General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments.
To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach goals for a wide range of tasks on command.
We propose Planning to Practice, a method that makes it practical to train goal-conditioned policies for long-horizon tasks.
arXiv Detail & Related papers (2022-05-17T06:58:17Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.