Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors
- URL: http://arxiv.org/abs/2210.02506v1
- Date: Wed, 5 Oct 2022 18:44:35 GMT
- Title: Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors
- Authors: Mohammad Reza Taesiri, Finlay Macklon, Yihe Wang, Hengshuo Shen,
Cor-Paul Bezemer
- Abstract summary: We show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game.
Our results show promising results for employing language models to detect video game bugs.
- Score: 3.39487428163997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video game testing requires game-specific knowledge as well as common sense
reasoning about the events in the game. While AI-driven agents can satisfy the
first requirement, it is not yet possible to meet the second requirement
automatically. Therefore, video game testing often still relies on manual
testing, and human testers are required to play the game thoroughly to detect
bugs. As a result, it is challenging to fully automate game testing. In this
study, we explore the possibility of leveraging the zero-shot capabilities of
large language models for video game bug detection. By formulating the bug
detection problem as a question-answering task, we show that large language
models can identify which event is buggy in a sequence of textual descriptions
of events from a game. To this end, we introduce the GameBugDescriptions
benchmark dataset, which consists of 167 buggy gameplay videos and a total of
334 question-answer pairs across 8 games. We extensively evaluate the
performance of six models across the OPT and InstructGPT large language model
families on our benchmark dataset. Our results show promising results for
employing language models to detect video game bugs. With the proper prompting
technique, we could achieve an accuracy of 70.66%, and on some video games, up
to 78.94%. Our code, evaluation data and the benchmark can be found on
https://asgaardlab.github.io/LLMxBugs
Related papers
- GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Deriving and Evaluating a Detailed Taxonomy of Game Bugs [2.2136561577994858]
The goal of this work is to provide a bug taxonomy for games that will help game developers in developing bug-resistant games.
We performed a Multivocal Literature Review (MLR) by analyzing 436 sources, out of which 189 (78 academic and 111 grey) sources reporting bugs encountered in the game development industry were selected for analysis.
The MLR allowed us to finalize a detailed taxonomy of 63 game bug categories in end-user perspective.
arXiv Detail & Related papers (2023-11-28T09:51:42Z) - Finding the Needle in a Haystack: Detecting Bug Occurrences in Gameplay
Videos [10.127506928281413]
We present an automated approach that uses machine learning to predict whether a segment of a gameplay video contains a depiction of a bug.
We analyzed 4,412 segments of 198 gameplay videos to predict whether a segment contains an instance of a bug.
Our approach is effective at detecting segments of a video that contain bugs, achieving a high F1 score of 0.88, outperforming the current state-of-the-art technique for bug classification.
arXiv Detail & Related papers (2023-11-18T01:14:18Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Learning to Identify Perceptual Bugs in 3D Video Games [1.370633147306388]
We show that it is possible to identify a range of perceptual bugs using learning-based methods.
World of Bugs (WOB) is an open platform for testing ABD methods in 3D game environments.
arXiv Detail & Related papers (2022-02-25T18:50:11Z) - CommonsenseQA 2.0: Exposing the Limits of AI through Gamification [126.85096257968414]
We construct benchmarks that test the abilities of modern natural language understanding models.
In this work, we propose gamification as a framework for data construction.
arXiv Detail & Related papers (2022-01-14T06:49:15Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.