Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games
- URL: http://arxiv.org/abs/2407.04467v2
- Date: Tue, 16 Jul 2024 15:19:27 GMT
- Title: Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games
- Authors: Nathan Herr, Fernando Acero, Roberta Raileanu, María Pérez-Ortiz, Zhibin Li,
- Abstract summary: Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic abilities remain largely unexplored.
We investigate LLMs' behaviour in strategic games, Stag Hunt and Prisoner Dilemma, analyzing performance variations under different settings and prompts.
Our results show that the tested state-of-the-art LLMs exhibit at least one of the following systematic biases: (1) positional bias, (2) payoff bias, or (3) behavioural bias.
- Score: 56.70628673595041
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic abilities remain largely unexplored. Game theory provides a good framework for assessing the decision-making abilities of LLMs in interactions with other agents. Although prior studies have shown that LLMs can solve these tasks with carefully curated prompts, they fail when the problem setting or prompt changes. In this work we investigate LLMs' behaviour in strategic games, Stag Hunt and Prisoner Dilemma, analyzing performance variations under different settings and prompts. Our results show that the tested state-of-the-art LLMs exhibit at least one of the following systematic biases: (1) positional bias, (2) payoff bias, or (3) behavioural bias. Subsequently, we observed that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. Performance is assessed based on the selection of the correct action, one which agrees with the prompted preferred behaviours of both players. Alignment refers to whether the LLM's bias aligns with the correct action. For example, GPT-4o's average performance drops by 34% when misaligned. Additionally, the current trend of "bigger and newer is better" does not hold for the above, where GPT-4o (the current best-performing LLM) suffers the most substantial performance drop. Lastly, we note that while chain-of-thought prompting does reduce the effect of the biases on most models, it is far from solving the problem at the fundamental level.
Related papers
- Social Bias Evaluation for Large Language Models Requires Prompt Variations [38.91306092184724]
Large Language Models (LLMs) exhibit considerable social biases.
This paper investigates the sensitivity of LLMs when changing prompt variations.
We show that LLMs have tradeoffs between performance and social bias caused by the prompts.
arXiv Detail & Related papers (2024-07-03T14:12:04Z) - Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better! [14.906150451947443]
We propose and study two versions of many-shot in-context prompts for helping GPT-4o-as-a-Judge in single answer grading.
Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the consistency and quality of the judgment results.
We reveal the symbol bias hidden in the pairwise comparison of GPT-4o-as-a-Judge and propose a simple yet effective approach to mitigate it.
arXiv Detail & Related papers (2024-06-17T15:11:58Z) - Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.
We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems.
Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z) - Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs [6.090496490133132]
We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases.
We propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback.
arXiv Detail & Related papers (2024-04-15T22:18:50Z) - How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [83.78240828340681]
This research investigates Large Language Models' decision-making capabilities through the lens of Game Theory.
We focus specifically on games that support the participation of more than two agents simultaneously.
We introduce our framework, GAMA-Bench, including eight classical multi-agent games.
arXiv Detail & Related papers (2024-03-18T14:04:47Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - Self-Supervised Position Debiasing for Large Language Models [39.261233221850155]
We propose a self-supervised position debiasing (SOD) framework to mitigate position bias for large language models (LLMs)
Experiments on eight datasets and five tasks show that SOD consistently outperforms existing methods in mitigating three types of position biases.
arXiv Detail & Related papers (2024-01-02T14:12:41Z) - Are You Sure? Challenging LLMs Leads to Performance Drops in The
FlipFlop Experiment [82.60594940370919]
We propose the FlipFlop experiment to study the multi-turn behavior of Large Language Models (LLMs)
We show that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect)
We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely.
arXiv Detail & Related papers (2023-11-14T23:40:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.