Policy Evaluation and Seeking for Multi-Agent Reinforcement Learning via
Best Response
- URL: http://arxiv.org/abs/2006.09585v2
- Date: Sat, 20 Jun 2020 04:22:47 GMT
- Title: Policy Evaluation and Seeking for Multi-Agent Reinforcement Learning via
Best Response
- Authors: Rui Yan and Xiaoming Duan and Zongying Shi and Yisheng Zhong and Jason
R. Marden and Francesco Bullo
- Abstract summary: We adopt strict best response dynamics to model selfish behaviors at a meta-level for multi-agent reinforcement learning.
Our approach is more compatible with single-agent reinforcement learning than alpha-rank which relies on weakly better responses.
- Score: 15.149039407681945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces two metrics (cycle-based and memory-based metrics),
grounded on a dynamical game-theoretic solution concept called sink
equilibrium, for the evaluation, ranking, and computation of policies in
multi-agent learning. We adopt strict best response dynamics (SBRD) to model
selfish behaviors at a meta-level for multi-agent reinforcement learning. Our
approach can deal with dynamical cyclical behaviors (unlike approaches based on
Nash equilibria and Elo ratings), and is more compatible with single-agent
reinforcement learning than alpha-rank which relies on weakly better responses.
We first consider settings where the difference between largest and second
largest underlying metric has a known lower bound. With this knowledge we
propose a class of perturbed SBRD with the following property: only policies
with maximum metric are observed with nonzero probability for a broad class of
stochastic games with finite memory. We then consider settings where the lower
bound for the difference is unknown. For this setting, we propose a class of
perturbed SBRD such that the metrics of the policies observed with nonzero
probability differ from the optimal by any given tolerance. The proposed
perturbed SBRD addresses the opponent-induced non-stationarity by fixing the
strategies of others for the learning agent, and uses empirical game-theoretic
analysis to estimate payoffs for each strategy profile obtained due to the
perturbation.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Preference Poisoning Attacks on Reward Model Learning [49.806139447922526]
We show how an attacker can flip a small subset of preference comparisons with the goal of either promoting or demoting a target outcome.
We find that the best attacks are often highly successful, achieving in the most extreme case 100% success rate with only 0.3% of the data poisoned.
We also show that several state-of-the-art defenses against other classes of poisoning attacks exhibit, at best, limited efficacy in our setting.
arXiv Detail & Related papers (2024-02-02T21:45:24Z) - Learning and Calibrating Heterogeneous Bounded Rational Market Behaviour
with Multi-Agent Reinforcement Learning [4.40301653518681]
Agent-based models (ABMs) have shown promise for modelling various real world phenomena incompatible with traditional equilibrium analysis.
Recent developments in multi-agent reinforcement learning (MARL) offer a way to address this issue from a rationality perspective.
We propose a novel technique for representing heterogeneous processing-constrained agents within a MARL framework.
arXiv Detail & Related papers (2024-02-01T17:21:45Z) - A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback.
Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training.
We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Discovering How Agents Learn Using Few Data [32.38609641970052]
We propose a theoretical and algorithmic framework for real-time identification of agent behavior using a short burst of a single system trajectory.
Our approach accurately recovers the true dynamics across various benchmarks, including equilibrium selection and prediction of chaotic systems up to 10 Lynov times.
These findings suggest that our approach has significant potential to support effective policy and decision-making in strategic multi-agent systems.
arXiv Detail & Related papers (2023-07-13T09:14:48Z) - Efficient Model-based Multi-agent Reinforcement Learning via Optimistic
Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment.
We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.