Automated Bug Frame Retrieval from Gameplay Videos Using Vision-Language Models
- URL: http://arxiv.org/abs/2508.04895v1
- Date: Wed, 06 Aug 2025 21:52:15 GMT
- Title: Automated Bug Frame Retrieval from Gameplay Videos Using Vision-Language Models
- Authors: Wentao Lu, Alexander Senchenko, Abram Hindle, Cor-Paul Bezemer,
- Abstract summary: We introduce a pipeline that reduces each video to a single frame that best matches the reported bug description.<n>Our approach dramatically reduces manual effort and speeds up triage and regression checks.<n>It offers practical benefits to quality assurance teams and developers across the game industry.
- Score: 47.63488459021783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern game studios deliver new builds and patches at a rapid pace, generating thousands of bug reports, many of which embed gameplay videos. To verify and triage these bug reports, developers must watch the submitted videos. This manual review is labour-intensive, slow, and hard to scale. In this paper, we introduce an automated pipeline that reduces each video to a single frame that best matches the reported bug description, giving developers instant visual evidence that pinpoints the bug. Our pipeline begins with FFmpeg for keyframe extraction, reducing each video to a median of just 1.90% of its original frames while still capturing bug moments in 98.79 of cases. These keyframes are then evaluated by a vision--language model (GPT-4o), which ranks them based on how well they match the textual bug description and selects the most representative frame. We evaluated this approach using real-world developer-submitted gameplay videos and JIRA bug reports from a popular First-Person Shooter (FPS) game. The pipeline achieves an overall F1 score of 0.79 and Accuracy of 0.89 for the top-1 retrieved frame. Performance is highest for the Lighting & Shadow (F1 = 0.94), Physics & Collision (0.86), and UI & HUD (0.83) bug categories, and lowest for Animation & VFX (0.51). By replacing video viewing with an immediately informative image, our approach dramatically reduces manual effort and speeds up triage and regression checks, offering practical benefits to quality assurance (QA) teams and developers across the game industry.
Related papers
- From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos [48.666667545084835]
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change.<n>We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR.<n> TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving.
arXiv Detail & Related papers (2025-06-05T17:31:17Z) - Semantic GUI Scene Learning and Video Alignment for Detecting Duplicate Video-based Bug Reports [16.45808969240553]
Video-based bug reports are increasingly being used to document bugs for programs centered around a graphical user interface (GUI)
We introduce a new approach, called JANUS, that adapts the scene-learning capabilities of vision transformers to capture subtle visual and textual patterns that manifest on app UI screens.
Janus also makes use of a video alignment technique capable of adaptive weighting of video frames to account for typical bug manifestation patterns.
arXiv Detail & Related papers (2024-07-11T15:48:36Z) - Finding the Needle in a Haystack: Detecting Bug Occurrences in Gameplay
Videos [10.127506928281413]
We present an automated approach that uses machine learning to predict whether a segment of a gameplay video contains a depiction of a bug.
We analyzed 4,412 segments of 198 gameplay videos to predict whether a segment contains an instance of a bug.
Our approach is effective at detecting segments of a video that contain bugs, achieving a high F1 score of 0.88, outperforming the current state-of-the-art technique for bug classification.
arXiv Detail & Related papers (2023-11-18T01:14:18Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary
Detection [70.99025467739715]
We release a new public Short video sHot bOundary deTection dataset, named SHOT.
SHOT consists of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos.
Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches.
arXiv Detail & Related papers (2023-04-12T19:01:21Z) - Making Video Quality Assessment Models Sensitive to Frame Rate
Distortions [63.749184706461826]
We consider the problem of capturing distortions arising from changes in frame rate as part of Video Quality Assessment (VQA)
We propose a simple fusion framework, whereby temporal features from GREED are combined with existing VQA models.
Our results suggest that employing efficient temporal representations can result much more robust and accurate VQA models.
arXiv Detail & Related papers (2022-05-21T04:13:57Z) - CLIP meets GamePhysics: Towards bug identification in gameplay videos
using zero-shot transfer learning [4.168157981135698]
We propose a search method that accepts any English text query as input to retrieve relevant gameplay videos.
Our approach does not rely on any external information (such as video metadata)
An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs.
arXiv Detail & Related papers (2022-03-21T16:23:02Z) - Unsupervised Visual Representation Learning by Tracking Patches in Video [88.56860674483752]
We propose to use tracking as a proxy task for a computer vision system to learn the visual representations.
Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations.
arXiv Detail & Related papers (2021-05-06T09:46:42Z) - Unsupervised Temporal Feature Aggregation for Event Detection in
Unstructured Sports Videos [10.230408415438966]
We study the case of event detection in sports videos for unstructured environments with arbitrary camera angles.
We identify and solve two major problems: unsupervised identification of players in an unstructured setting and generalization of the trained models to pose variations due to arbitrary shooting angles.
arXiv Detail & Related papers (2020-02-19T10:24:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.