Related papers: FAIRGAMER: Evaluating Biases in the Application of Large Language Models to Video Games

FAIRGAMER: Evaluating Biases in the Application of Large Language Models to Video Games

URL: http://arxiv.org/abs/2508.17825v1
Date: Mon, 25 Aug 2025 09:26:19 GMT
Title: FAIRGAMER: Evaluating Biases in the Application of Large Language Models to Video Games
Authors: Bingkang Shi, Jen-tse Huang, Guoyi Li, Xiaodan Zhang, Zhongjiang Yao,
Abstract summary: We show that Large Language Models' inherent social biases can directly damage game balance in real-world gaming environments.<n>We present FairGamer, the first bias evaluation Benchmark for LLMs in video game scenarios.
Score: 9.989488318132539
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Leveraging their advanced capabilities, Large Language Models (LLMs) demonstrate vast application potential in video games--from dynamic scene generation and intelligent NPC interactions to adaptive opponents--replacing or enhancing traditional game mechanics. However, LLMs' trustworthiness in this application has not been sufficiently explored. In this paper, we reveal that the models' inherent social biases can directly damage game balance in real-world gaming environments. To this end, we present FairGamer, the first bias evaluation Benchmark for LLMs in video game scenarios, featuring six tasks and a novel metrics ${D_lstd}$. It covers three key scenarios in games where LLMs' social biases are particularly likely to manifest: Serving as Non-Player Characters, Interacting as Competitive Opponents, and Generating Game Scenes. FairGamer utilizes both reality-grounded and fully fictional game content, covering a variety of video game genres. Experiments reveal: (1) Decision biases directly cause game balance degradation, with Grok-3 (average ${D_lstd}$ score=0.431) exhibiting the most severe degradation; (2) LLMs demonstrate isomorphic social/cultural biases toward both real and virtual world content, suggesting their biases nature may stem from inherent model characteristics. These findings expose critical reliability gaps in LLMs' gaming applications. Our code and data are available at anonymous GitHub https://github.com/Anonymous999-xxx/FairGamer .

Related papers

Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory [37.51238507036326]
We use the game of Twenty Questions to evaluate the information-seeking ability of Large Language Models (LLMs)<n>We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game.
arXiv Detail & Related papers (2026-02-02T06:33:18Z)
Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies [54.08697738311866]
Social deduction games like Werewolf combine language, reasoning, and strategy.<n>We curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants.<n>We propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages.
arXiv Detail & Related papers (2025-10-13T13:33:30Z)
VideoGameBench: Can Vision-Language Models complete popular video games? [8.5302862604852]
Video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases.<n>We introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time.<n>We show that frontier vision-language models struggle to progress beyond the beginning of each game.
arXiv Detail & Related papers (2025-05-23T17:43:27Z)
lmgame-Bench: How Good are LLMs at Playing Games? [60.01834131847881]
We study the major challenges in using popular video games to evaluate modern large language model (LLM) agents.<n>We introduce lmgame-Bench to turn games into reliable evaluations.
arXiv Detail & Related papers (2025-05-21T06:02:55Z)
ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition [14.753916893216129]
ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess Large Language Models (LLMs)<n>ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate)
arXiv Detail & Related papers (2025-04-17T01:23:50Z)
Can Large Language Models Capture Video Game Engagement? [1.3873323883842132]
We evaluate comprehensively the capacity of popular Large Language Models to annotate and successfully predict continuous affect annotations of videos.<n>We run over 2,400 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction.
arXiv Detail & Related papers (2025-02-05T17:14:47Z)
GAMEBoT: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.<n>We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.<n>Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z)
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters [97.11173801187816]
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content. This paper critically examines gender biases in LLM-generated reference letters.
arXiv Detail & Related papers (2023-10-13T16:12:57Z)
GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs) GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms. We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z)
Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators. It allows a user to play the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.