Off-Belief Learning
- URL: http://arxiv.org/abs/2103.04000v1
- Date: Sat, 6 Mar 2021 01:09:55 GMT
- Title: Off-Belief Learning
- Authors: Hengyuan Hu, Adam Lerer, Brandon Cui, Luis Pineda, David Wu, Noam
Brown, Jakob Foerster
- Abstract summary: We present off-belief learning (OBL) to learn optimal policies that are fully grounded.
OBL converges to a unique policy, making it more suitable for zero-shot coordination.
OBL shows strong performance in both a simple toy-setting and the benchmark human-AI/zero-shot coordination problem Hanabi.
- Score: 21.98027225621791
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The standard problem setting in Dec-POMDPs is self-play, where the goal is to
find a set of policies that play optimally together. Policies learned through
self-play may adopt arbitrary conventions and rely on multi-step counterfactual
reasoning based on assumptions about other agents' actions and thus fail when
paired with humans or independently trained agents. In contrast, no current
methods can learn optimal policies that are fully grounded, i.e., do not rely
on counterfactual information from observing other agents' actions. To address
this, we present off-belief learning} (OBL): at each time step OBL agents
assume that all past actions were taken by a given, fixed policy ($\pi_0$), but
that future actions will be taken by an optimal policy under these same
assumptions. When $\pi_0$ is uniform random, OBL learns the optimal grounded
policy. OBL can be iterated in a hierarchy, where the optimal policy from one
level becomes the input to the next. This introduces counterfactual reasoning
in a controlled manner. Unlike independent RL which may converge to any
equilibrium policy, OBL converges to a unique policy, making it more suitable
for zero-shot coordination. OBL can be scaled to high-dimensional settings with
a fictitious transition mechanism and shows strong performance in both a simple
toy-setting and the benchmark human-AI/zero-shot coordination problem Hanabi.
Related papers
- Inference-Time Policy Steering through Human Interactions [54.02655062969934]
During inference, humans are often removed from the policy execution loop.
We propose an Inference-Time Policy Steering framework that leverages human interactions to bias the generative sampling process.
Our proposed sampling strategy achieves the best trade-off between alignment and distribution shift.
arXiv Detail & Related papers (2024-11-25T18:03:50Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.
We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.
We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally.
In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value.
Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z) - AgentMixer: Multi-Agent Correlated Policy Factorization [39.041191852287525]
We introduce textitstrategy modification to provide a mechanism for agents to correlate their policies.
We present a novel framework, AgentMixer, which constructs the joint fully observable policy as a non-linear combination of individual partially observable policies.
We show that AgentMixer converges to an $epsilon$-approximate Correlated Equilibrium.
arXiv Detail & Related papers (2024-01-16T15:32:41Z) - Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Independent Policy Gradient Methods for Competitive Reinforcement
Learning [62.91197073795261]
We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents.
We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule.
arXiv Detail & Related papers (2021-01-11T23:20:42Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.