Related papers: Will GPT-4 Run DOOM?

Will GPT-4 Run DOOM?

URL: http://arxiv.org/abs/2403.05468v1
Date: Fri, 8 Mar 2024 17:30:41 GMT
Title: Will GPT-4 Run DOOM?
Authors: Adrian de Wynter
Abstract summary: We show that GPT-4's reasoning and planning capabilities extend to the 1993 first-person shooter Doom. We find that GPT-4 can play the game to a passable degree: it is able to manipulate doors, combat enemies, and perform pathing.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We show that GPT-4's reasoning and planning capabilities extend to the 1993 first-person shooter Doom. This large language model (LLM) is able to run and play the game with only a few instructions, plus a textual description--generated by the model itself from screenshots--about the state of the game being observed. We find that GPT-4 can play the game to a passable degree: it is able to manipulate doors, combat enemies, and perform pathing. More complex prompting strategies involving multiple model calls provide better results. While further work is required to enable the LLM to play the game as well as its classical, reinforcement learning-based counterparts, we note that GPT-4 required no training, leaning instead on its own reasoning and observational capabilities. We hope our work pushes the boundaries on intelligent, LLM-based agents in video games. We conclude by discussing the ethical implications of our work.

Related papers

Baba is LLM: Reasoning in a Game with Dynamic Rules [0.0]
Large language models (LLMs) are known to perform well on language tasks, but struggle with reasoning tasks.<n>This paper explores the ability of LLMs to play the 2D puzzle game Baba is You.
arXiv Detail & Related papers (2025-06-23T20:16:28Z)
lmgame-Bench: How Good are LLMs at Playing Games? [60.01834131847881]
We study the major challenges in using popular video games to evaluate modern large language model (LLM) agents.<n>We introduce lmgame-Bench to turn games into reliable evaluations.
arXiv Detail & Related papers (2025-05-21T06:02:55Z)
See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances. We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z)
Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay [0.0]
We use games like Tic-Tac-Toe, Connect Four, and Battleship to assess strategic thinking and decision-making. Despite their proficiency on standard benchmarks, GPT-3.5 and GPT-4's abilities to play and reason about fully observable games without pre-training is mediocre.
arXiv Detail & Related papers (2024-07-12T14:17:26Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications. This paper evaluates LLMs' reasoning abilities in competitive environments. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z)
How FaR Are Large Language Models From Agents with Theory-of-Mind? [69.41586417697732]
We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D) T4D requires models to connect inferences about others' mental states to actions in social scenarios. We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
arXiv Detail & Related papers (2023-10-04T06:47:58Z)
What's Next in Affective Modeling? Large Language Models [3.0902630634005797]
GPT-4 performs well across multiple emotion tasks. It can distinguish emotion theories and come up with emotional stories. We suggest that LLMs could play an important role in affective modeling.
arXiv Detail & Related papers (2023-10-03T16:39:20Z)
Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4 [37.64921394844022]
GPT-4, the recent breakthrough in large language models (LLMs) trained on massive passive data, is notable for its knowledge retrieval and reasoning abilities. This paper delves into the applicability of GPT-4's learned knowledge for imperfect information games. We introduce Suspicion-Agent, an innovative agent that leverages GPT-4's capabilities for performing in imperfect information games.
arXiv Detail & Related papers (2023-09-29T14:30:03Z)
Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing [0.0]
This paper investigates the strategic decision-making capabilities of three Large Language Models (LLMs): GPT-3.5, GPT-4, and LLaMa-2. Utilizing four canonical two-player games, we explore how these models navigate social dilemmas.
arXiv Detail & Related papers (2023-09-12T00:54:15Z)
Playing repeated games with Large Language Models [20.63964279913456]
We use behavioral game theory to study Large Language Models' cooperation and coordination behavior. Our results show that LLMs generally perform well in such tasks and also uncover persistent behavioral signatures. These results enrich our understanding of LLM's social behavior and pave the way for a behavioral game theory for machines.
arXiv Detail & Related papers (2023-05-26T12:17:59Z)
SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM) In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)
Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data. We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more. We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.