PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
- URL: http://arxiv.org/abs/2412.01800v1
- Date: Mon, 02 Dec 2024 18:47:25 GMT
- Title: PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
- Authors: Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, Xiaodan Liang,
- Abstract summary: We propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos.
Our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts.
Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM.
- Score: 66.09921831504238
- License:
- Abstract: Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.
Related papers
- UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models [39.917074900737575]
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks.
The domain of physics reasoning presents unique challenges that have received significantly less attention.
Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics.
arXiv Detail & Related papers (2025-02-01T06:42:02Z) - Do generative video models learn physical principles from watching videos? [15.534227431706773]
AI video generation is undergoing a revolution, with quality and realism advancing rapidly.
Do video models learn "world models" that discover laws of physics, or are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality?
We address this question by developing Physics-IQ, a benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles.
arXiv Detail & Related papers (2025-01-14T20:59:37Z) - VideoPhy: Evaluating Physical Commonsense for Video Generation [93.28748850301949]
We present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities.
We then generate videos conditioned on captions from diverse state-of-the-art text-to-video generative models.
Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts.
arXiv Detail & Related papers (2024-06-05T17:53:55Z) - DreamPhysics: Learning Physics-Based 3D Dynamics with Video Diffusion Priors [75.83647027123119]
We propose to learn the physical properties of a material field with video diffusion priors.
We then utilize a physics-based Material-Point-Method simulator to generate 4D content with realistic motions.
arXiv Detail & Related papers (2024-06-03T16:05:25Z) - CLIP meets GamePhysics: Towards bug identification in gameplay videos
using zero-shot transfer learning [4.168157981135698]
We propose a search method that accepts any English text query as input to retrieve relevant gameplay videos.
Our approach does not rely on any external information (such as video metadata)
An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs.
arXiv Detail & Related papers (2022-03-21T16:23:02Z) - Dynamic Visual Reasoning by Learning Differentiable Physics Models from
Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language.
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z) - Occlusion resistant learning of intuitive physics from videos [52.25308231683798]
Key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation.
This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences.
arXiv Detail & Related papers (2020-04-30T19:35:54Z) - Use the Force, Luke! Learning to Predict Physical Forces by Simulating
Effects [79.351446087227]
We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects.
Specifically, we use a simulator to predict effects and enforce that estimated forces must lead to the same effect as depicted in the video.
arXiv Detail & Related papers (2020-03-26T17:20:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.