How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
- URL: http://arxiv.org/abs/2403.11807v4
- Date: Mon, 30 Sep 2024 20:57:58 GMT
- Title: How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
- Authors: Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael R. Lyu,
- Abstract summary: We introduce GAMA($gamma$)-Bench, a new framework for evaluating Large Language Models' Gaming Ability in Multi-Agent environments.
$gamma$-Bench includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to assess LLMs' performance.
Results indicate GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought.
- Score: 83.78240828340681
- License:
- Abstract: Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating Large Language Models (LLMs). Researchers have examined LLMs' decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs' performance. $\gamma$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate twelve LLMs from six model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $68.1$ out of $100$, followed by LLaMA-3.1-70B ($64.5$) and Mixtral-8x22B ($61.4$). All code and experimental results are publicly available via https://github.com/CUHK-ARISE/GAMABench.
Related papers
- LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions.
To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline.
Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z) - LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity [7.945893812374361]
We introduce the focal diversity metric to capture the diversity-performance correlation among component LLMs of an ensemble.
We develop a diversity-optimized ensemble pruning algorithm to select the top-k sub-ensembles from a pool of $N$ base LLMs.
Our pruning method recommends top-performing LLM subensembles of size $S$, often much smaller than $N$.
arXiv Detail & Related papers (2024-10-04T22:31:15Z) - MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs [21.689490112983677]
We introduce MetaLLM, a framework that dynamically routes each query to the optimal large language models (LLMs) for classification tasks.
By framing the selection problem as a multi-armed bandit, MetaLLM balances prediction accuracy and cost efficiency under uncertainty.
Our experiments, conducted on popular LLM platforms, showcase MetaLLM's efficacy in real-world scenarios.
arXiv Detail & Related papers (2024-07-15T15:45:07Z) - Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard [0.0]
We introduce a novel benchmark for large language models (LLMs) through grid-based games such as Tic-Tac-Toe, Connect Four, and Gomoku.
The open-source game simulation code available on GitHub allows LLMs to compete and generates detailed data files.
We present the results of games among leading LLMs, including Claude 3.5 Sonnet and Claude 3 Sonnet by Anthropic, Gemini 1.5 Pro and Gemini Flash by Google, GPT-4 Turbo and GPT-4o by OpenAI, and Llama3-70B by Meta.
arXiv Detail & Related papers (2024-07-10T16:14:34Z) - Can Large Language Models Play Games? A Case Study of A Self-Play
Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions.
This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.