Policy Evaluation and Seeking for Multi-Agent Reinforcement Learning via
Best Response
- URL: http://arxiv.org/abs/2006.09585v2
- Date: Sat, 20 Jun 2020 04:22:47 GMT
- Title: Policy Evaluation and Seeking for Multi-Agent Reinforcement Learning via
Best Response
- Authors: Rui Yan and Xiaoming Duan and Zongying Shi and Yisheng Zhong and Jason
R. Marden and Francesco Bullo
- Abstract summary: We adopt strict best response dynamics to model selfish behaviors at a meta-level for multi-agent reinforcement learning.
Our approach is more compatible with single-agent reinforcement learning than alpha-rank which relies on weakly better responses.
- Score: 15.149039407681945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces two metrics (cycle-based and memory-based metrics),
grounded on a dynamical game-theoretic solution concept called sink
equilibrium, for the evaluation, ranking, and computation of policies in
multi-agent learning. We adopt strict best response dynamics (SBRD) to model
selfish behaviors at a meta-level for multi-agent reinforcement learning. Our
approach can deal with dynamical cyclical behaviors (unlike approaches based on
Nash equilibria and Elo ratings), and is more compatible with single-agent
reinforcement learning than alpha-rank which relies on weakly better responses.
We first consider settings where the difference between largest and second
largest underlying metric has a known lower bound. With this knowledge we
propose a class of perturbed SBRD with the following property: only policies
with maximum metric are observed with nonzero probability for a broad class of
stochastic games with finite memory. We then consider settings where the lower
bound for the difference is unknown. For this setting, we propose a class of
perturbed SBRD such that the metrics of the policies observed with nonzero
probability differ from the optimal by any given tolerance. The proposed
perturbed SBRD addresses the opponent-induced non-stationarity by fixing the
strategies of others for the learning agent, and uses empirical game-theoretic
analysis to estimate payoffs for each strategy profile obtained due to the
perturbation.
Related papers
- Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations.
We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games.
Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Zero-Sum Positional Differential Games as a Framework for Robust Reinforcement Learning: Deep Q-Learning Approach [2.3020018305241337]
This paper is the first to propose considering the RRL problems within the positional differential game theory.
Namely, we prove that under Isaacs's condition, the same Q-function can be utilized as an approximate solution of both minimax and maximin Bellman equations.
We present the Isaacs Deep Q-Network algorithms and demonstrate their superiority compared to other baseline RRL and Multi-Agent RL algorithms in various environments.
arXiv Detail & Related papers (2024-05-03T12:21:43Z) - Learning and Calibrating Heterogeneous Bounded Rational Market Behaviour
with Multi-Agent Reinforcement Learning [4.40301653518681]
Agent-based models (ABMs) have shown promise for modelling various real world phenomena incompatible with traditional equilibrium analysis.
Recent developments in multi-agent reinforcement learning (MARL) offer a way to address this issue from a rationality perspective.
We propose a novel technique for representing heterogeneous processing-constrained agents within a MARL framework.
arXiv Detail & Related papers (2024-02-01T17:21:45Z) - A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback.
Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training.
We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Efficient Model-based Multi-agent Reinforcement Learning via Optimistic
Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment.
We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.