VideoGameBunny: Towards vision assistants for video games
- URL: http://arxiv.org/abs/2407.15295v1
- Date: Sun, 21 Jul 2024 23:31:57 GMT
- Title: VideoGameBunny: Towards vision assistants for video games
- Authors: Mohammad Reza Taesiri, Cor-Paul Bezemer,
- Abstract summary: This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games.
We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles.
Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b.
- Score: 4.652236080354487
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/
Related papers
- WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - L4GM: Large 4D Gaussian Reconstruction Model [99.82220378522624]
We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input.
Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects.
arXiv Detail & Related papers (2024-06-14T17:51:18Z) - Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z) - Using Gameplay Videos for Detecting Issues in Video Games [14.41863992598613]
Streamers may encounter several problems (such as bugs, glitches, or performance issues) while they play.
The identified problems may negatively impact the user's gaming experience and, in turn, can harm the reputation of the game and of the producer.
We propose and empirically evaluate GELID, an approach for automatically extracting relevant information from gameplay videos.
arXiv Detail & Related papers (2023-07-27T10:16:04Z) - GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for
Real-time Soccer Commentary Generation [75.60413443783953]
We present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC)
Our data and code are available at https://github.com/THU-KEG/goal.
arXiv Detail & Related papers (2023-03-26T08:43:36Z) - Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion
Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators.
It allows a user to play the game by prompting it with high- and low-level action sequences.
Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt.
Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z) - Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors [3.39487428163997]
We show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game.
Our results show promising results for employing language models to detect video game bugs.
arXiv Detail & Related papers (2022-10-05T18:44:35Z) - WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z) - CLIP meets GamePhysics: Towards bug identification in gameplay videos
using zero-shot transfer learning [4.168157981135698]
We propose a search method that accepts any English text query as input to retrieve relevant gameplay videos.
Our approach does not rely on any external information (such as video metadata)
An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs.
arXiv Detail & Related papers (2022-03-21T16:23:02Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.