Related papers: Style-Preserving Policy Optimization for Game Agents

Related papers

Learning Controllable and Diverse Player Behaviors in Multi-Agent Environments [0.0]
This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data.<n>We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses a subset representing real human styles.<n>A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining.
arXiv Detail & Related papers (2025-12-11T17:26:24Z)
Robust and Diverse Multi-Agent Learning via Rational Policy Gradient [14.745600697014753]
We develop a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational.<n> Rational Policy Gradient (RPG) trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective.<n>RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self-sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies.
arXiv Detail & Related papers (2025-11-12T18:34:11Z)
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition [52.232968183793986]
General Policy Composition (GPC) is a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies.<n>GPC consistently improves performance and adaptability across a diverse set of tasks.
arXiv Detail & Related papers (2025-10-01T16:05:53Z)
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces [57.466101098183884]
Reinforcement learning (RL) struggles to scale to large, action spaces common in many real-world problems.<n>This paper introduces a novel framework for training discrete diffusion models as highly effective policies in complex settings.
arXiv Detail & Related papers (2025-09-26T21:53:36Z)
Mxplainer: Explain and Learn Insights by Imitating Mahjong Agents [0.8088999193162028]
This paper introduces Mxplainer, a parameterized search algorithm that can be converted into an equivalent neural network to learn the parameters of black-box agents.<n>Experiments conducted on AI and human player data demonstrate that the learned parameters provide human-understandable insights into these agents' characteristics and play styles.
arXiv Detail & Related papers (2025-06-17T07:07:13Z)
Preference-based opponent shaping in differentiable games [3.373994463906893]
We propose a novel Preference-based Opponent Shaping (PBOS) method to enhance the strategy learning process by shaping agents' preferences towards cooperation.<n>We verify the performance of PBOS algorithm in a variety of differentiable games.
arXiv Detail & Related papers (2024-12-04T06:49:21Z)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)
Toward Optimal LLM Alignments Using Two-Player Games [86.39338084862324]
In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
arXiv Detail & Related papers (2024-06-16T15:24:50Z)
Reinforcement Learning for High-Level Strategic Control in Tower Defense Games [47.618236610219554]
In strategy games, one of the most important aspects of game design is maintaining a sense of challenge for players. We propose an automated approach that combines traditional scripted methods with reinforcement learning. Results show that combining a learned approach, such as reinforcement learning, with a scripted AI produces a higher-performing and more robust agent than using only AI.
arXiv Detail & Related papers (2024-06-12T08:06:31Z)
Best Response Shaping [1.0874100424278175]
LOLA and POLA agents learn reciprocity-based cooperative policies by differentiation through a few look-ahead optimization steps of their opponent. Because they consider a few optimization steps, a learning opponent that takes many steps to optimize its return may exploit them. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approxing the best response.
arXiv Detail & Related papers (2024-04-05T22:03:35Z)
DanZero+: Dominating the GuanDan Game through Reinforcement Learning [95.90682269990705]
We develop an AI program for an exceptionally complex and popular card game called GuanDan. We first put forward an AI program named DanZero for this game. In order to further enhance the AI's capabilities, we apply policy-based reinforcement learning algorithm to GuanDan.
arXiv Detail & Related papers (2023-12-05T08:07:32Z)
Learning Diverse Risk Preferences in Population-based Self-play [23.07952140353786]
Current self-play algorithms optimize the agent to maximize expected win-rates against its current or historical copies. We introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty. We show that our method achieves comparable or superior performance in competitive games.
arXiv Detail & Related papers (2023-05-19T06:56:02Z)
Building a 3-Player Mahjong AI using Deep Reinforcement Learning [9.603486077267693]
We present Meowjong, an AI for Sanma using deep reinforcement learning. Meowjong's models achieve test accuracies comparable with AIs for 4-player Mahjong. Being the first ever AI in Sanma, we claim that Meowjong stands as a state-of-the-art in this game.
arXiv Detail & Related papers (2022-02-25T17:41:43Z)
A Ranking Game for Imitation Learning [22.028680861819215]
We treat imitation as a two-player ranking-based Stackelberg game between a $textitpolicy$ and a $textitreward$ function. This game encompasses a large subset of both inverse reinforcement learning (IRL) methods and methods which learn from offline preferences. We theoretically analyze the requirements of the loss function used for ranking policy performances to facilitate near-optimal imitation learning at equilibrium.
arXiv Detail & Related papers (2022-02-07T19:38:22Z)
Learning Meta Representations for Agents in Multi-Agent Reinforcement Learning [12.170248966278281]
In multi-agent reinforcement learning, behaviors that agents learn in a single Markov Game (MG) are typically confined to the given agent number. In this work, our focus is on creating agents that can generalize across population-varying MGs. Instead of learning a unimodal policy, each agent learns a policy set comprising effective strategies across a variety of games.
arXiv Detail & Related papers (2021-08-30T04:30:53Z)
Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods. We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z)
Policy Fusion for Adaptive and Customizable Reinforcement Learning Agents [137.86426963572214]
We show how to combine distinct behavioral policies to obtain a meaningful "fusion" policy. We propose four different policy fusion methods for combining pre-trained policies. We provide several practical examples and use-cases for how these methods are indeed useful for video game production and designers.
arXiv Detail & Related papers (2021-04-21T16:08:44Z)
Portfolio Search and Optimization for General Strategy Game-Playing [58.896302717975445]
We propose a new algorithm for optimization and action-selection based on the Rolling Horizon Evolutionary Algorithm. For the optimization of the agents' parameters and portfolio sets we study the use of the N-tuple Bandit Evolutionary Algorithm. An analysis of the agents' performance shows that the proposed algorithm generalizes well to all game-modes and is able to outperform other portfolio methods.
arXiv Detail & Related papers (2021-04-21T09:28:28Z)
Generating Diverse and Competitive Play-Styles for Strategy Games [58.896302717975445]
We propose Portfolio Monte Carlo Tree Search with Progressive Unpruning for playing a turn-based strategy game (Tribes) We show how it can be parameterized so a quality-diversity algorithm (MAP-Elites) is used to achieve different play-styles while keeping a competitive level of play. Our results show that this algorithm is capable of achieving these goals even for an extensive collection of game levels beyond those used for training.
arXiv Detail & Related papers (2021-04-17T20:33:24Z)
On the Power of Refined Skat Selection [1.3706331473063877]
Skat is a fascinating card game, show-casing many of the intrinsic challenges for modern AI systems. We propose hard expert-rules and a scoring function based on refined skat evaluation features. Experiments emphasize the impact of the refined skat putting algorithm on the playing performance of the bots.
arXiv Detail & Related papers (2021-04-07T08:54:58Z)
Forgetful Experience Replay in Hierarchical Reinforcement Learning from Demonstrations [55.41644538483948]
In this paper, we propose a combination of approaches that allow the agent to use low-quality demonstrations in complex vision-based environments. Our proposed goal-oriented structuring of replay buffer allows the agent to automatically highlight sub-goals for solving complex hierarchical tasks in demonstrations. The solution based on our algorithm beats all the solutions for the famous MineRL competition and allows the agent to mine a diamond in the Minecraft environment.
arXiv Detail & Related papers (2020-06-17T15:38:40Z)
Enhanced Rolling Horizon Evolution Algorithm with Opponent Model Learning: Results for the Fighting Game AI Competition [9.75720700239984]
We propose a novel algorithm that combines Rolling Horizon Evolution Algorithm (RHEA) with opponent model learning. Our proposed bot with the policy-gradient-based opponent model is the only one without using Monte-Carlo Tree Search (MCTS) among top five bots in the 2019 competition.
arXiv Detail & Related papers (2020-03-31T04:44:33Z)
Suphx: Mastering Mahjong with Deep Reinforcement Learning [114.68233321904623]
We design an AI for Mahjong, named Suphx, based on deep reinforcement learning with some newly introduced techniques. Suphx has demonstrated stronger performance than most top human players in terms of stable rank. This is the first time that a computer program outperforms most top human players in Mahjong.
arXiv Detail & Related papers (2020-03-30T16:18:16Z)
Efficient exploration of zero-sum stochastic games [83.28949556413717]
We investigate the increasingly important and common game-solving setting where we do not have an explicit description of the game but only oracle access to it through gameplay. During a limited-duration learning phase, the algorithm can control the actions of both players in order to try to learn the game and how to play it well. Our motivation is to quickly learn strategies that have low exploitability in situations where evaluating the payoffs of a queried strategy profile is costly.
arXiv Detail & Related papers (2020-02-24T20:30:38Z)
Provable Self-Play Algorithms for Competitive Reinforcement Learning [48.12602400021397]
We study self-play in competitive reinforcement learning under the setting of Markov games. We show that a self-play algorithm achieves regret $tildemathcalO(sqrtT)$ after playing $T$ steps of the game. We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret $tildemathcalO(T2/3)$, but is guaranteed to run in time even in the worst case.
arXiv Detail & Related papers (2020-02-10T18:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.