Related papers: CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

URL: http://arxiv.org/abs/2203.11096v2
Date: Tue, 22 Mar 2022 23:37:49 GMT
Title: CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
Authors: Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer
Abstract summary: We propose a search method that accepts any English text query as input to retrieve relevant gameplay videos. Our approach does not rely on any external information (such as video metadata) An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs.
Score: 4.168157981135698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gameplay videos contain rich information about how players interact with the game and how the game responds. Sharing gameplay videos on social media platforms, such as Reddit, has become a common practice for many players. Often, players will share gameplay videos that showcase video game bugs. Such gameplay videos are software artifacts that can be utilized for game testing, as they provide insight for bug analysis. Although large repositories of gameplay videos exist, parsing and mining them in an effective and structured fashion has still remained a big challenge. In this paper, we propose a search method that accepts any English text query as input to retrieve relevant videos from large repositories of gameplay videos. Our approach does not rely on any external information (such as video metadata); it works solely based on the content of the video. By leveraging the zero-shot transfer capabilities of the Contrastive Language-Image Pre-Training (CLIP) model, our approach does not require any data labeling or training. To evaluate our approach, we present the $\texttt{GamePhysics}$ dataset consisting of 26,954 videos from 1,873 games, that were collected from the GamePhysics section on the Reddit website. Our approach shows promising results in our extensive analysis of simple queries, compound queries, and bug queries, indicating that our approach is useful for object and event detection in gameplay videos. An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs. Please visit the following link for the code and the data: https://asgaardlab.github.io/CLIPxGamePhysics/

Related papers

Can Text-to-Video Generation help Video-Language Alignment? [53.0276936367765]
Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others.
arXiv Detail & Related papers (2025-03-24T10:02:22Z)
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos [66.09921831504238]
We propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. Our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM.
arXiv Detail & Related papers (2024-12-02T18:47:25Z)
VideoGameBunny: Towards vision assistants for video games [4.652236080354487]
This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b.
arXiv Detail & Related papers (2024-07-21T23:31:57Z)
Finding the Needle in a Haystack: Detecting Bug Occurrences in Gameplay Videos [10.127506928281413]
We present an automated approach that uses machine learning to predict whether a segment of a gameplay video contains a depiction of a bug. We analyzed 4,412 segments of 198 gameplay videos to predict whether a segment contains an instance of a bug. Our approach is effective at detecting segments of a video that contain bugs, achieving a high F1 score of 0.88, outperforming the current state-of-the-art technique for bug classification.
arXiv Detail & Related papers (2023-11-18T01:14:18Z)
Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones. Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure. Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z)
Using Gameplay Videos for Detecting Issues in Video Games [14.41863992598613]
Streamers may encounter several problems (such as bugs, glitches, or performance issues) while they play. The identified problems may negatively impact the user's gaming experience and, in turn, can harm the reputation of the game and of the producer. We propose and empirically evaluate GELID, an approach for automatically extracting relevant information from gameplay videos.
arXiv Detail & Related papers (2023-07-27T10:16:04Z)
TG-VQA: Ternary Game of Video Question Answering [33.180788803602084]
Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them. In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies. Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts.
arXiv Detail & Related papers (2023-05-17T08:42:53Z)
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation [75.60413443783953]
We present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC) Our data and code are available at https://github.com/THU-KEG/goal.
arXiv Detail & Related papers (2023-03-26T08:43:36Z)
Subjective and Objective Analysis of Streamed Gaming Videos [60.32100758447269]
We study subjective and objective Video Quality Assessment (VQA) models on gaming videos. We created a novel gaming video video resource, called the LIVE-YouTube Gaming video quality (LIVE-YT-Gaming) database, comprised of 600 real gaming videos. We conducted a subjective human study on this data, yielding 18,600 human quality ratings recorded by 61 human subjects.
arXiv Detail & Related papers (2022-03-24T03:02:57Z)
Few-Shot Learning for Video Object Detection in a Transfer-Learning Scheme [70.45901040613015]
We study the new problem of few-shot learning for video object detection. We employ a transfer-learning framework to effectively train the video object detector on a large number of base-class objects and a few video clips of novel-class objects.
arXiv Detail & Related papers (2021-03-26T20:37:55Z)
What is More Likely to Happen Next? Video-and-Language Future Event Prediction [111.93601253692165]
Given a video with aligned dialogue, people can often infer what is more likely to happen next. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. We collect a new dataset, named Video-and-Language Event Prediction, with 28,726 future event prediction examples.
arXiv Detail & Related papers (2020-10-15T19:56:47Z)
Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded. We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.