PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
- URL: http://arxiv.org/abs/2412.01800v1
- Date: Mon, 02 Dec 2024 18:47:25 GMT
- Title: PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
- Authors: Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, Xiaodan Liang,
- Abstract summary: We propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos.<n>Our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts.<n>Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM.
- Score: 66.09921831504238
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.
Related papers
- VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior [88.51778468222766]
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos.
VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics.
We propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior.
arXiv Detail & Related papers (2025-03-30T09:03:09Z) - VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation [66.58048825989239]
VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos.
We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos.
Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
arXiv Detail & Related papers (2025-03-09T22:49:12Z) - Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection [2.1013864820763755]
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge.
Phys-AD is the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection.
The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies.
arXiv Detail & Related papers (2025-03-05T14:49:08Z) - VideoPhy: Evaluating Physical Commonsense for Video Generation [93.28748850301949]
We present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities.
We then generate videos conditioned on captions from diverse state-of-the-art text-to-video generative models.
Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts.
arXiv Detail & Related papers (2024-06-05T17:53:55Z) - DreamPhysics: Learning Physical Properties of Dynamic 3D Gaussians with Video Diffusion Priors [75.83647027123119]
We propose to learn the physical properties of a material field with video diffusion priors.
We then utilize a physics-based Material-Point-Method simulator to generate 4D content with realistic motions.
arXiv Detail & Related papers (2024-06-03T16:05:25Z) - CLIP meets GamePhysics: Towards bug identification in gameplay videos
using zero-shot transfer learning [4.168157981135698]
We propose a search method that accepts any English text query as input to retrieve relevant gameplay videos.
Our approach does not rely on any external information (such as video metadata)
An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs.
arXiv Detail & Related papers (2022-03-21T16:23:02Z) - Dynamic Visual Reasoning by Learning Differentiable Physics Models from
Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language.
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.