Related papers: Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors

URL: http://arxiv.org/abs/2210.02506v1
Date: Wed, 5 Oct 2022 18:44:35 GMT
Title: Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors
Authors: Mohammad Reza Taesiri, Finlay Macklon, Yihe Wang, Hengshuo Shen, Cor-Paul Bezemer
Abstract summary: We show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. Our results show promising results for employing language models to detect video game bugs.
Score: 3.39487428163997
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, it is challenging to fully automate game testing. In this study, we explore the possibility of leveraging the zero-shot capabilities of large language models for video game bug detection. By formulating the bug detection problem as a question-answering task, we show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. To this end, we introduce the GameBugDescriptions benchmark dataset, which consists of 167 buggy gameplay videos and a total of 334 question-answer pairs across 8 games. We extensively evaluate the performance of six models across the OPT and InstructGPT large language model families on our benchmark dataset. Our results show promising results for employing language models to detect video game bugs. With the proper prompting technique, we could achieve an accuracy of 70.66%, and on some video games, up to 78.94%. Our code, evaluation data and the benchmark can be found on https://asgaardlab.github.io/LLMxBugs

Related papers

Automated Bug Frame Retrieval from Gameplay Videos Using Vision-Language Models [47.63488459021783]
We introduce a pipeline that reduces each video to a single frame that best matches the reported bug description.<n>Our approach dramatically reduces manual effort and speeds up triage and regression checks.<n>It offers practical benefits to quality assurance teams and developers across the game industry.
arXiv Detail & Related papers (2025-08-06T21:52:15Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
Deriving and Evaluating a Detailed Taxonomy of Game Bugs [2.2136561577994858]
The goal of this work is to provide a bug taxonomy for games that will help game developers in developing bug-resistant games. We performed a Multivocal Literature Review (MLR) by analyzing 436 sources, out of which 189 (78 academic and 111 grey) sources reporting bugs encountered in the game development industry were selected for analysis. The MLR allowed us to finalize a detailed taxonomy of 63 game bug categories in end-user perspective.
arXiv Detail & Related papers (2023-11-28T09:51:42Z)
Finding the Needle in a Haystack: Detecting Bug Occurrences in Gameplay Videos [10.127506928281413]
We present an automated approach that uses machine learning to predict whether a segment of a gameplay video contains a depiction of a bug. We analyzed 4,412 segments of 198 gameplay videos to predict whether a segment contains an instance of a bug. Our approach is effective at detecting segments of a video that contain bugs, achieving a high F1 score of 0.88, outperforming the current state-of-the-art technique for bug classification.
arXiv Detail & Related papers (2023-11-18T01:14:18Z)
Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models. The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Learning to Identify Perceptual Bugs in 3D Video Games [1.370633147306388]
We show that it is possible to identify a range of perceptual bugs using learning-based methods. World of Bugs (WOB) is an open platform for testing ABD methods in 3D game environments.
arXiv Detail & Related papers (2022-02-25T18:50:11Z)
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification [126.85096257968414]
We construct benchmarks that test the abilities of modern natural language understanding models. In this work, we propose gamification as a framework for data construction.
arXiv Detail & Related papers (2022-01-14T06:49:15Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.