QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
- URL: http://arxiv.org/abs/2512.19526v1
- Date: Mon, 22 Dec 2025 16:18:00 GMT
- Title: QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
- Authors: Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli,
- Abstract summary: QuantiPhy is the first benchmark designed to quantitatively measure a VLM's physical reasoning ability.<n>Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness.
- Score: 14.860588888047708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.
Related papers
- FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding [3.451422886843121]
Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos.<n>Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation.<n>We propose FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos.
arXiv Detail & Related papers (2026-01-24T02:17:07Z) - TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility [70.24211591214528]
Video generative models produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing.<n>Existing Video-Language Models (VLMs) struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning.<n>We introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding.<n>We propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding.
arXiv Detail & Related papers (2025-10-08T21:03:46Z) - ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks [10.848408092385192]
We introduce a systematic evaluation framework with images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions.<n>We instantiate this benchmark into ORBIT, a multi-level reasoning VQA benchmark for object properties comprising 360 images paired with a total of 1,080 count-based questions.<n>Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations compared to humans, with the best-performing model only reaching 40% accuracy.
arXiv Detail & Related papers (2025-08-14T11:28:40Z) - HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z) - VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z) - Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models [11.282655911647483]
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs)<n>We introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions.<n>PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding.
arXiv Detail & Related papers (2024-12-11T18:40:16Z) - Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts [33.12056808870413]
We introduce Value-Spectrum, a novel Visual Question Answering (VQA) benchmark aimed at assessing Vision-Language Models (VLMs)<n>We design a VLM agent pipeline to simulate video browsing and construct a vector database comprising over 50,000 short videos from TikTok, YouTube Shorts, and Instagram Reels.<n>These videos span multiple months and cover diverse topics, including family, health, hobbies, society, technology, etc.
arXiv Detail & Related papers (2024-11-18T11:31:10Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z) - Leveraging VLM-Based Pipelines to Annotate 3D Objects [68.51034848207355]
We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response.
Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods.
We show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the 764K dataset.
arXiv Detail & Related papers (2023-11-29T17:54:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.