Related papers: Combining Deep Reinforcement Learning and Search with Generative Models for Game-Theoretic Opponent Modeling

Combining Deep Reinforcement Learning and Search with Generative Models for Game-Theoretic Opponent Modeling

URL: http://arxiv.org/abs/2302.00797v2
Date: Fri, 13 Jun 2025 15:38:03 GMT
Title: Combining Deep Reinforcement Learning and Search with Generative Models for Game-Theoretic Opponent Modeling
Authors: Zun Li, Marc Lanctot, Kevin R. McKee, Luke Marris, Ian Gemp, Daniel Hennes, Paul Muller, Kate Larson, Yoram Bachrach, Michael P. Wellman,
Abstract summary: We introduce a scalable and generic multiagent training regime for opponent modeling using deep game-theoretic reinforcement learning.<n>We first propose Generative Best Respoonse (GenBR), a best response algorithm based on Monte-Carlo Tree Search (MCTS)<n>We use this new method under the framework of Policy Space Response Oracles (PSRO) to automate the generation of an emphoffline opponent model.
Score: 30.465929764202155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Opponent modeling methods typically involve two crucial steps: building a belief distribution over opponents' strategies, and exploiting this opponent model by playing a best response. However, existing approaches typically require domain-specific heurstics to come up with such a model, and algorithms for approximating best responses are hard to scale in large, imperfect information domains. In this work, we introduce a scalable and generic multiagent training regime for opponent modeling using deep game-theoretic reinforcement learning. We first propose Generative Best Respoonse (GenBR), a best response algorithm based on Monte-Carlo Tree Search (MCTS) with a learned deep generative model that samples world states during planning. This new method scales to large imperfect information domains and can be plug and play in a variety of multiagent algorithms. We use this new method under the framework of Policy Space Response Oracles (PSRO), to automate the generation of an \emph{offline opponent model} via iterative game-theoretic reasoning and population-based training. We propose using solution concepts based on bargaining theory to build up an opponent mixture, which we find identifying profiles that are near the Pareto frontier. Then GenBR keeps updating an \emph{online opponent model} and reacts against it during gameplay. We conduct behavioral studies where human participants negotiate with our agents in Deal-or-No-Deal, a class of bilateral bargaining games. Search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare and Nash bargaining score negotiating with humans as humans trading among themselves.

Related papers

Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees [91.88803125231189]
Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences.<n>While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem.<n>In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game.
arXiv Detail & Related papers (2025-02-18T09:33:48Z)
Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities [69.34646544774161]
We formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures arrival of requests to each arm and the policy of allocating requests to players. The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile. We design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds.
arXiv Detail & Related papers (2024-08-20T13:57:00Z)
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO) Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z)
Toward Optimal LLM Alignments Using Two-Player Games [86.39338084862324]
In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
arXiv Detail & Related papers (2024-06-16T15:24:50Z)
Best Response Shaping [1.0874100424278175]
LOLA and POLA agents learn reciprocity-based cooperative policies by differentiation through a few look-ahead optimization steps of their opponent. Because they consider a few optimization steps, a learning opponent that takes many steps to optimize its return may exploit them. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approxing the best response.
arXiv Detail & Related papers (2024-04-05T22:03:35Z)
A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z)
Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback. We term this approach Nash learning from human feedback (NLHF) We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z)
Population-based Evaluation in Repeated Rock-Paper-Scissors as a Benchmark for Multiagent Reinforcement Learning [14.37986882249142]
We propose a benchmark for multiagent learning based on repeated play of the simple game Rock, Paper, Scissors. We describe metrics to measure the quality of agents based both on average returns and exploitability.
arXiv Detail & Related papers (2023-03-02T15:06:52Z)
Double Deep Q-Learning in Opponent Modeling [0.0]
Multi-agent systems in which secondary agents with conflicting agendas also alter their methods need opponent modeling. In this study, we simulate the main agent's and secondary agents' tactics using Double Deep Q-Networks (DDQN) with a prioritized experience replay mechanism. Under the opponent modeling setup, a Mixture-of-Experts architecture is used to identify various opponent strategy patterns.
arXiv Detail & Related papers (2022-11-24T06:07:47Z)
Human-AI Coordination via Human-Regularized Search and Learning [33.95649252941375]
We develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark. We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels. We show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents.
arXiv Detail & Related papers (2022-10-11T03:46:12Z)
Reinforcement Learning Agents in Colonel Blotto [0.0]
We focus on a specific instance of agent-based models, which uses reinforcement learning (RL) to train the agent how to act in its environment. We find that the RL agent handily beats a single opponent, and still performs quite well when the number of opponents are increased. We also analyze the RL agent and look at what strategies it has arrived by looking at the actions that it has given the highest and lowest Q-values.
arXiv Detail & Related papers (2022-04-04T16:18:01Z)
Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment. We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z)
Finding General Equilibria in Many-Agent Economic Simulations Using Deep Reinforcement Learning [72.23843557783533]
We show that deep reinforcement learning can discover stable solutions that are epsilon-Nash equilibria for a meta-game over agent types. Our approach is more flexible and does not need unrealistic assumptions, e.g., market clearing. We demonstrate our approach in real-business-cycle models, a representative family of DGE models, with 100 worker-consumers, 10 firms, and a government who taxes and redistributes.
arXiv Detail & Related papers (2022-01-03T17:00:17Z)
Reinforcement Learning In Two Player Zero Sum Simultaneous Action Games [0.0]
Two player zero sum simultaneous action games are common in video games, financial markets, war, business competition, and many other settings. We introduce the fundamental concepts of reinforcement learning in two player zero sum simultaneous action games and discuss the unique challenges this type of game poses. We introduce two novel agents that attempt to handle these challenges by using joint action Deep Q-Networks.
arXiv Detail & Related papers (2021-10-10T16:03:44Z)
Collective eXplainable AI: Explaining Cooperative Strategies and Agent Contribution in Multiagent Reinforcement Learning with Shapley Values [68.8204255655161]
This study proposes a novel approach to explain cooperative strategies in multiagent RL using Shapley values. Results could have implications for non-discriminatory decision making, ethical and responsible AI-derived decisions or policy making under fairness constraints.
arXiv Detail & Related papers (2021-10-04T10:28:57Z)
Influence-based Reinforcement Learning for Intrinsically-motivated Agents [0.0]
We present an algorithmic framework of two reinforcement learning agents each with a different objective. We introduce a novel function approximation approach to assess the influence $F$ of a certain policy on others. Our method was evaluated on the suite of OpenAI gym tasks as well as cooperative and mixed scenarios.
arXiv Detail & Related papers (2021-08-28T05:36:10Z)
L2E: Learning to Exploit Your Opponent [66.66334543946672]
We propose a novel Learning to Exploit framework for implicit opponent modeling. L2E acquires the ability to exploit opponents by a few interactions with different opponents during training. We propose a novel opponent strategy generation algorithm that produces effective opponents for training automatically.
arXiv Detail & Related papers (2021-02-18T14:27:59Z)
Learning to Play Sequential Games versus Unknown Opponents [93.8672371143881]
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action. We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents. Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response.
arXiv Detail & Related papers (2020-07-10T09:33:05Z)
Learning to Model Opponent Learning [11.61673411387596]
Multi-Agent Reinforcement Learning (MARL) considers settings in which a set of coexisting agents interact with one another and their environment. This poses a great challenge for value function-based algorithms whose convergence usually relies on the assumption of a stationary environment. We develop a novel approach to modelling an opponent's learning dynamics which we term Learning to Model Opponent Learning (LeMOL)
arXiv Detail & Related papers (2020-06-06T17:19:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.