Self-adaptive PSRO: Towards an Automatic Population-based Game Solver
- URL: http://arxiv.org/abs/2404.11144v1
- Date: Wed, 17 Apr 2024 07:40:57 GMT
- Title: Self-adaptive PSRO: Towards an Automatic Population-based Game Solver
- Authors: Pengdeng Li, Shuxin Li, Chang Yang, Xinrun Wang, Xiao Huang, Hau Chan, Bo An,
- Abstract summary: Policy-Space Response Oracles (PSRO) as a general algorithmic framework has achieved state-of-the-art performance in learning equilibrium policies of two-player zero-sum games.
We make the first attempt to investigate the possibility of self-adaptively determining the optimal hyper parameter values in the PSRO framework.
Experiments on various two-player zero-sum games demonstrate the superiority of SPSRO over different baselines.
- Score: 34.326819257554874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy-Space Response Oracles (PSRO) as a general algorithmic framework has achieved state-of-the-art performance in learning equilibrium policies of two-player zero-sum games. However, the hand-crafted hyperparameter value selection in most of the existing works requires extensive domain knowledge, forming the main barrier to applying PSRO to different games. In this work, we make the first attempt to investigate the possibility of self-adaptively determining the optimal hyperparameter values in the PSRO framework. Our contributions are three-fold: (1) Using several hyperparameters, we propose a parametric PSRO that unifies the gradient descent ascent (GDA) and different PSRO variants. (2) We propose the self-adaptive PSRO (SPSRO) by casting the hyperparameter value selection of the parametric PSRO as a hyperparameter optimization (HPO) problem where our objective is to learn an HPO policy that can self-adaptively determine the optimal hyperparameter values during the running of the parametric PSRO. (3) To overcome the poor performance of online HPO methods, we propose a novel offline HPO approach to optimize the HPO policy based on the Transformer architecture. Experiments on various two-player zero-sum games demonstrate the superiority of SPSRO over different baselines.
Related papers
- Hierarchical Preference Optimization: Learning to achieve goals via feasible subgoals prediction [71.81851971324187]
This work introduces Hierarchical Preference Optimization (HPO), a novel approach to hierarchical reinforcement learning (HRL)
HPO addresses non-stationarity and infeasible subgoal generation issues when solving complex robotic control tasks.
Experiments on challenging robotic navigation and manipulation tasks demonstrate impressive performance of HPO, where it shows an improvement of up to 35% over the baselines.
arXiv Detail & Related papers (2024-11-01T04:58:40Z) - ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning [42.33815055388433]
ARLBench is a benchmark for hyperparameter optimization (HPO) in reinforcement learning (RL)
It allows comparisons of diverse HPO approaches while being highly efficient in evaluation.
ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL.
arXiv Detail & Related papers (2024-09-27T15:22:28Z) - AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation [37.456499537121886]
Recent advancements in Large Language Models have transformed ML/AI development.
Recent advancements in Large Language Models have transformed AutoML principles for the Retrieval-Augmented Generation (RAG) systems.
arXiv Detail & Related papers (2024-06-27T15:18:21Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization [57.450449884166346]
We propose an adaptive HPO method to account for the privacy cost of HPO.
We obtain state-of-the-art performance on 22 benchmark tasks, across computer vision and natural language processing, across pretraining and finetuning.
arXiv Detail & Related papers (2022-12-08T18:56:37Z) - Feasible Adversarial Robust Reinforcement Learning for Underspecified
Environments [11.866835246140647]
In real-world environments, choosing the set of possible values for robust reinforcement learning can be a difficult task.
We propose Feasible Adversarial Robust RL (FARR), a method for automatically determining the set of environment parameter values over which to be robust.
Using the PSRO algorithm to find an approximate Nash equilibrium in this FARR game, we show that an agent trained with FARR is more robust to feasible adversarial parameter selection than with existing minimax, domain-randomization, and regret objectives.
arXiv Detail & Related papers (2022-07-19T23:57:51Z) - Towards Learning Universal Hyperparameter Optimizers with Transformers [57.35920571605559]
We introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction.
Our experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates.
arXiv Detail & Related papers (2022-05-26T12:51:32Z) - Efficient Policy Space Response Oracles [61.71849698253696]
Policy Space Response Oracle method (PSRO) provides a general solution to Nash equilibrium in two-player zero-sum games.
Central to our development is the newly-introduced of minimax optimization on unrestricted-restricted (URR) games.
We report a 50x speedup in wall-time, 10x data efficiency, and similar exploitability as existing PSRO methods on Kuhn and Leduc Poker games.
arXiv Detail & Related papers (2022-01-28T17:54:45Z) - Towards Hyperparameter-free Policy Selection for Offline Reinforcement
Learning [10.457660611114457]
We show how to select between policies and value functions produced by different training algorithms in offline reinforcement learning.
We use BVFT [XJ21], a recent theoretical advance in value-function selection, and demonstrate their effectiveness in discrete-action benchmarks such as Atari.
arXiv Detail & Related papers (2021-10-26T20:12:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.