Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs
- URL: http://arxiv.org/abs/2406.04482v1
- Date: Thu, 6 Jun 2024 20:11:08 GMT
- Title: Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs
- Authors: Claire Jin, Sudha Rao, Xiangyu Peng, Portia Botchway, Jessica Quaye, Chris Brockett, Bill Dolan,
- Abstract summary: Large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players andNPCs.
LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs.
We propose a systematic LLM-based method for automatically identifying such bugs from player game logs.
- Score: 17.84810486479385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws.
Related papers
- Cardiverse: Harnessing LLMs for Novel Card Game Prototyping [9.435009911810955]
Card games require extensive human effort in creative ideation and gameplay evaluation.
Recent advances in Large Language Models offer opportunities to automate and streamline these processes.
This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework.
arXiv Detail & Related papers (2025-02-10T23:47:35Z) - PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing [34.768989900184636]
Bug fixing holds significant importance in software development and maintenance.
Recent research has made substantial strides in exploring the potential of large language models (LLMs) for automatically resolving software bugs.
arXiv Detail & Related papers (2025-01-27T15:43:04Z) - Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.
We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.
Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z) - Revealing the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing [14.950721395944388]
We propose a probing dataset to evaluate LLMs' ability to detect errors in KKE and UKE.
The results indicate that even the latest LLMs struggle to effectively detect these two types of errors.
We propose an agent-based reasoning method, Self-Recollection and Self-Doubt, to further explore the potential for improving error detection capabilities.
arXiv Detail & Related papers (2024-09-18T06:21:44Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - DALD: Improving Logits-based Detector without Logits from Black-box LLMs [56.234109491884126]
Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing.
We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection.
DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
arXiv Detail & Related papers (2024-06-07T19:38:05Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.