Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs
- URL: http://arxiv.org/abs/2406.04482v1
- Date: Thu, 6 Jun 2024 20:11:08 GMT
- Title: Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs
- Authors: Claire Jin, Sudha Rao, Xiangyu Peng, Portia Botchway, Jessica Quaye, Chris Brockett, Bill Dolan,
- Abstract summary: Large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players andNPCs.
LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs.
We propose a systematic LLM-based method for automatically identifying such bugs from player game logs.
- Score: 17.84810486479385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws.
Related papers
- Grammar-based Game Description Generation using Large Language Models [12.329521804287259]
We introduce the grammar of game descriptions, which effectively structures the game design space, into the reasoning process.
Our experiments demonstrate that this approach performs well in generating game descriptions.
arXiv Detail & Related papers (2024-07-24T16:36:02Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - Improving Logits-based Detector without Logits from Black-box LLMs [56.234109491884126]
Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing.
We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection.
DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
arXiv Detail & Related papers (2024-06-07T19:38:05Z) - Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends [38.86240794422485]
We evaluate the faithfulness of large language models for dialogue summarization.
Our evaluation reveals subtleties as to what constitutes a hallucination.
We introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics.
arXiv Detail & Related papers (2024-06-05T17:49:47Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models [38.79074982172423]
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text.
We propose modeling factual queries as constraint satisfaction problems.
We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations.
arXiv Detail & Related papers (2023-09-26T17:48:55Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z) - Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as
Conversational Agents [20.202525145391093]
Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"
This paper explores: Can Large Language Models be evaluated meaningfully by exposing them to constrained game-like settings?
As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions.
arXiv Detail & Related papers (2023-05-22T19:56:10Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.