Related papers: Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations

Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations

URL: http://arxiv.org/abs/2403.03407v2
Date: Mon, 3 Jun 2024 15:00:47 GMT
Title: Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations
Authors: Max Lamparth, Anthony Corso, Jacob Ganz, Oriana Skylar Mastro, Jacquelyn Schneider, Harold Trinkunas,
Abstract summary: We show that large language models (LLMs) behave differently compared to humans in high-stakes military decision-making scenarios. Our results motivate policymakers to be cautious before granting autonomy or following AI-based strategy recommendations.
Score: 1.6108153271585284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To some, the advent of artificial intelligence (AI) promises better decision-making and increased military effectiveness while reducing the influence of human error and emotions. However, there is still debate about how AI systems, especially large language models (LLMs), behave compared to humans in high-stakes military decision-making scenarios with the potential for increased risks towards escalation and unnecessary conflicts. To test this potential and scrutinize the use of LLMs for such purposes, we use a new wargame experiment with 107 national security experts designed to look at crisis escalation in a fictional US-China scenario and compare human players to LLM-simulated responses in separate simulations. Wargames have a long history in the development of military strategy and the response of nations to threats or attacks. Here, we show a considerable high-level agreement in the LLM and human responses and significant quantitative and qualitative differences in individual actions and strategic tendencies. These differences depend on intrinsic biases in LLMs regarding the appropriate level of violence following strategic instructions, the choice of LLM, and whether the LLMs are tasked to decide for a team of players directly or first to simulate dialog between players. When simulating the dialog, the discussions lack quality and maintain a farcical harmony. The LLM simulations cannot account for human player characteristics, showing no significant difference even for extreme traits, such as "pacifist" or "aggressive sociopath". Our results motivate policymakers to be cautious before granting autonomy or following AI-based strategy recommendations.

Related papers

How large language models judge and influence human cooperation [82.07571393247476]
We assess how state-of-the-art language models judge cooperative actions.<n>We observe a remarkable agreement in evaluating cooperation against good opponents.<n>We show that the differences revealed between models can significantly impact the prevalence of cooperation.
arXiv Detail & Related papers (2025-06-30T09:14:42Z)
Beyond Nash Equilibrium: Bounded Rationality of LLMs and humans in Strategic Decision-making [33.2843381902912]
Large language models are increasingly used in strategic decision-making settings.<n>We compare LLMs and humans using experimental paradigms adapted from behavioral game-theory research.
arXiv Detail & Related papers (2025-06-11T04:43:54Z)
SocialEval: Evaluating Social Intelligence of Large Language Models [70.90981021629021]
Social Intelligence (SI) equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals.<n>This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation.<n>We propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.
arXiv Detail & Related papers (2025-06-01T08:36:51Z)
Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks [52.098988739649705]
This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
arXiv Detail & Related papers (2025-05-28T01:31:54Z)
Humans expect rationality and cooperation from LLM opponents in strategic games [0.0]
We present the results of the first monetarily-incentivised laboratory experiment looking at differences in human behaviour.<n>We show that, in this environment, human subjects choose significantly lower numbers when playing against LLMs than humans.<n>This shift is mainly driven by subjects with high strategic reasoning ability.
arXiv Detail & Related papers (2025-05-16T09:01:09Z)
Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks [6.355245936740126]
Large language models (LLMs) are increasingly used to simulate or automate human behavior in sequential decision-making tasks.<n>We focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty.<n>We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration.
arXiv Detail & Related papers (2025-05-15T02:09:18Z)
Exploring Persona-dependent LLM Alignment for the Moral Machine Experiment [23.7081830844157]
This study examines the alignment between socio-driven decisions and human judgment in various contexts of the moral machine experiment. We find that the moral decisions of LLMs vary substantially by persona, showing greater shifts in moral decisions for critical tasks than humans. We discuss the ethical implications and risks associated with deploying these models in applications that involve moral decisions.
arXiv Detail & Related papers (2025-04-15T05:29:51Z)
Measurement of LLM's Philosophies of Human Nature [113.47929131143766]
We design the standardized psychological scale specifically targeting large language models (LLM) We show that current LLMs exhibit a systemic lack of trust in humans. We propose a mental loop learning framework, which enables LLM to continuously optimize its value system.
arXiv Detail & Related papers (2025-04-03T06:22:19Z)
AI persuading AI vs AI persuading Humans: LLMs' Differential Effectiveness in Promoting Pro-Environmental Behavior [70.24245082578167]
Pro-environmental behavior (PEB) is vital to combat climate change, yet turning awareness into intention and action remains elusive. We explore large language models (LLMs) as tools to promote PEB, comparing their impact across 3,200 participants. Results reveal a "synthetic persuasion paradox": synthetic and simulated agents significantly affect their post-intervention PEB stance, while human responses barely shift.
arXiv Detail & Related papers (2025-03-03T21:40:55Z)
Can Machines Think Like Humans? A Behavioral Evaluation of LLM-Agents in Dictator Games [7.504095239018173]
Large Language Model (LLM)-based agents increasingly undertake real-world tasks and engage with human society. This study investigates how different personas and experimental framings affect these AI agents' altruistic behavior. Despite being trained on extensive human-generated data, these AI agents cannot accurately predict human decisions.
arXiv Detail & Related papers (2024-10-28T17:47:41Z)
Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina [7.155982875107922]
Studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates or simulations for humans in social science research. We assess the reasoning depth of LLMs using the 11-20 money request game.
arXiv Detail & Related papers (2024-10-25T14:46:07Z)
Large Language Models Reflect the Ideology of their Creators [73.25935570218375]
Large language models (LLMs) are trained on vast amounts of data to generate natural language. We uncover notable diversity in the ideological stance exhibited across different LLMs and languages.
arXiv Detail & Related papers (2024-10-24T04:02:30Z)
Who is Undercover? Guiding LLMs to Explore Multi-Perspective Team Tactic in the Game [3.8284679578037246]
We use the language logic game Who is Undercover?'' as an experimental platform to propose the Multi-Perspective Team Tactic (MPTT) framework. MPTT aims to cultivate LLMs' human-like language expression logic, multi-dimensional thinking, and self-perception in complex scenarios. Preliminary results show that MPTT, combined with WIU, leverages LLMs' cognitive capabilities to create a decision-making framework that can simulate real society.
arXiv Detail & Related papers (2024-10-20T06:41:31Z)
Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations [12.887834116390358]
We use a metric based on BERTScore to measure response inconsistency quantitatively. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences. We recommend further consideration be taken before using LMs to inform military decisions.
arXiv Detail & Related papers (2024-10-17T04:12:17Z)
Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction. We find that contextual characteristics significantly affect human reliance behavior. Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z)
Nicer Than Humans: How do Large Language Models Behave in the Prisoner's Dilemma? [0.1474723404975345]
We study the cooperative behavior of Llama2 when playing the Iterated Prisoner's Dilemma against random adversaries displaying various levels of hostility. We find that Llama2 tends not to initiate defection but it adopts a cautious approach towards cooperation. In comparison to prior research on human participants, Llama2 exhibits a greater inclination towards cooperative behavior.
arXiv Detail & Related papers (2024-06-19T14:51:14Z)
ALYMPICS: LLM Agents Meet Game Theory -- Exploring Strategic Decision-Making with AI Agents [77.34720446306419]
Alympics is a systematic simulation framework utilizing Large Language Model (LLM) agents for game theory research. Alympics creates a versatile platform for studying complex game theory problems.
arXiv Detail & Related papers (2023-11-06T16:03:46Z)
Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy'' We develop DEEP to evaluate LLMs' expression and disguising abilities. We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z)
Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [52.930183136111864]
We propose using scorable negotiation to evaluate Large Language Models (LLMs) To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities. We provide procedures to create new games and increase games' difficulty to have an evolving benchmark.
arXiv Detail & Related papers (2023-09-29T13:33:06Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.