Related papers: Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

URL: http://arxiv.org/abs/2602.21020v1
Date: Tue, 24 Feb 2026 15:38:11 GMT
Title: Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning
Authors: Antoine Bergerault, Volkan Cevher, Negar Mehr,
Abstract summary: Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains.<n>Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL.
Score: 51.77462571479799
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error $ε_{\text{BC}}$, this provides a Nash imitation gap of $\mathcal{O}\left(nε_{\text{BC}}/(1-γ)^2\right)$ for a discount factor $γ$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

Related papers

Statistical analysis of Inverse Entropy-regularized Reinforcement Learning [15.054399128586232]
Inverse reinforcement learning aims to infer the reward function that explains expert behavior observed through trajectories of state--action pairs.<n>Many reward functions can induce the same optimal policy, rendering the inverse problem ill-posed.<n>We develop a statistical framework for Inverse Entropy-regularized Reinforcement Learning.
arXiv Detail & Related papers (2025-12-07T18:26:19Z)
Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs [16.69532546126409]
We study the problem of offline imitation learning in Markov decision processes (MDPs)<n>We introduce a new algorithm called saddle-point offline imitation learning (SPOIL)<n>SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.
arXiv Detail & Related papers (2025-05-26T13:10:27Z)
Accelerating Nash Learning from Human Feedback via Mirror Prox [36.04055906691423]
We introduce Nash Mirror Prox ($mathtNash-MP$), an online NLHF algorithm that leverages the Mirror Prox optimization scheme to achieve fast and stable convergence to the Nash equilibrium.<n>Our theoretical analysis establishes that Nash-MP exhibits last-iterate linear convergence towards the $beta$-regularized Nash equilibrium.<n>We show that Nash-MP exhibits last-iterate linear convergence for the exploitability gap and uniformly for the span semi-norm of log-probabilities.
arXiv Detail & Related papers (2025-05-26T09:17:32Z)
Multi-Agent Inverse Q-Learning from Demonstrations [3.4136908117644698]
Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL) is a novel sample-efficient framework for multi-agent IRL.<n>We show MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x.
arXiv Detail & Related papers (2025-03-06T18:22:29Z)
Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits [49.96531901205305]
We analyze $f$-divergence-regularized offline policy learning.<n>For reverse Kullback-Leibler (KL) divergence, we give the first $tildeO(epsilon-1)$ sample complexity under single-policy concentrability.<n>We extend our analysis to dueling bandits, and we believe these results take a significant step toward a comprehensive understanding of $f$-divergence-regularized policy learning.
arXiv Detail & Related papers (2025-02-09T22:14:45Z)
A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning [53.83345471268163]
We investigate learning the equilibria in non-stationary multi-agent systems. We show how to test for various types of equilibria by a black-box reduction to single-agent learning.
arXiv Detail & Related papers (2023-06-12T23:48:24Z)
LS-IQ: Implicit Reward Regularization for Inverse Reinforcement Learning [30.4251858001151]
We show that a squared norm regularization on the implicit reward function is effective, but do not provide a theoretical analysis of the resulting properties of the algorithms. We show that our method, Least Squares Inverse Q-Learning, outperforms state-of-the-art algorithms, particularly in environments with absorbing states.
arXiv Detail & Related papers (2023-03-01T15:46:12Z)
Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning. We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg. We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z)
Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. The goal is to find a policy that matches the expert's performance on some predefined set of cost functions. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.