Multi-Agent Imitation Learning: Value is Easy, Regret is Hard
- URL: http://arxiv.org/abs/2406.04219v2
- Date: Wed, 26 Jun 2024 03:39:31 GMT
- Title: Multi-Agent Imitation Learning: Value is Easy, Regret is Hard
- Authors: Jingwu Tang, Gokul Swamy, Fei Fang, Zhiwei Steven Wu,
- Abstract summary: We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents.
Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations.
While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee to deviations by strategic agents.
- Score: 52.31989962031179
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents based on demonstrations of an expert doing so. Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations. While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee robustness to deviations by strategic agents. Intuitively, this is because strategic deviations can depend on a counterfactual quantity: the coordinator's recommendations outside of the state distribution their recommendations induce. In response, we initiate the study of an alternative objective for MAIL in Markov Games we term the regret gap that explicitly accounts for potential deviations by agents in the group. We first perform an in-depth exploration of the relationship between the value and regret gaps. First, we show that while the value gap can be efficiently minimized via a direct extension of single-agent IL algorithms, even value equivalence can lead to an arbitrarily large regret gap. This implies that achieving regret equivalence is harder than achieving value equivalence in MAIL. We then provide a pair of efficient reductions to no-regret online convex optimization that are capable of minimizing the regret gap (a) under a coverage assumption on the expert (MALICE) or (b) with access to a queryable expert (BLADES).
Related papers
- Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards [38.056359612828466]
We propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro)
We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway)
arXiv Detail & Related papers (2024-10-08T08:04:09Z) - Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI.
Agent's goal is to identify the best arm with a prescribed confidence level.
Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z) - Provable Interactive Learning with Hindsight Instruction Feedback [29.754170272323105]
We study learning with hindsight instruction where a teacher provides an instruction that is most suitable for the agent's generated response.
This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response.
We introduce an algorithm called LORIL for this setting and show that its regret scales as $sqrtT$ where $T$ is the number of rounds and depends on the intrinsic rank.
arXiv Detail & Related papers (2024-04-14T02:18:07Z) - A Simple Solution for Offline Imitation from Observations and Examples
with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable.
We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z) - Learning to Incentivize Information Acquisition: Proper Scoring Rules
Meet Principal-Agent Model [64.94131130042275]
We study the incentivized information acquisition problem, where a principal hires an agent to gather information on her behalf.
We design a provably sample efficient algorithm that tailors the UCB algorithm to our model.
Our algorithm features a delicate estimation procedure for the optimal profit of the principal, and a conservative correction scheme that ensures the desired agent's actions are incentivized.
arXiv Detail & Related papers (2023-03-15T13:40:16Z) - Taming Multi-Agent Reinforcement Learning with Estimator Variance
Reduction [12.94372063457462]
Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms.
It suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state.
We propose an enhancement tool that accommodates any actor-critic MARL method.
arXiv Detail & Related papers (2022-09-02T13:44:00Z) - LobsDICE: Offline Imitation Learning from Observation via Stationary
Distribution Correction Estimation [37.31080581310114]
We present LobsDICE, an offline IfO algorithm that learns to imitate the expert policy via optimization in the space of stationary distributions.
Our algorithm solves a single convex minimization problem, which minimizes the divergence between the two state-transition distributions induced by the expert and the agent policy.
arXiv Detail & Related papers (2022-02-28T04:24:30Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Tilted Empirical Risk Minimization [26.87656095874882]
We show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to empirical risk minimization.
We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness.
It can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.
arXiv Detail & Related papers (2020-07-02T14:49:48Z) - An Information Bottleneck Approach for Controlling Conciseness in
Rationale Extraction [84.49035467829819]
We show that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective.
Our fully unsupervised approach jointly learns an explainer that predicts sparse binary masks over sentences, and an end-task predictor that considers only the extracted rationale.
arXiv Detail & Related papers (2020-05-01T23:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.