Related papers: From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

URL: http://arxiv.org/abs/2508.10770v1
Date: Thu, 14 Aug 2025 15:55:48 GMT
Title: From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models
Authors: Tiancheng Han, Yunfei Gao, Yong Li, Wuzhou Yu, Qiaosheng Zhang, Wenqi Shao,
Abstract summary: Physical reasoning is a critical step towards building robust world models.<n>Recent vision language models (VLMs) have shown remarkable progress in specialized domains.<n>But their capability for physical reasoning remains largely unexplored.
Score: 10.740632493925018
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model's generalization to new physics scenarios remains limited -- underscoring the pressing need for new approaches in spatio-physical reasoning.

Related papers

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads [91.05736019384489]
We introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning.<n>Our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models.
arXiv Detail & Related papers (2026-02-10T06:28:08Z)
PAI-Bench: A Comprehensive Benchmark For Physical AI [70.22914615084215]
Video generative models often struggle to maintain physically coherent dynamics.<n>Multi-modal large language models exhibit limited performance in forecasting and causal interpretation.<n>These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI.
arXiv Detail & Related papers (2025-12-01T18:47:39Z)
LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
Why Cannot Neural Networks Master Extrapolation? Insights from Physical Laws [0.0]
Motivated by the remarkable success of Foundation Models (FMs) in language modeling, there has been growing interest in developing FMs for time series prediction.<n>This work identifies and formalizes a fundamental property characterizing the ability of statistical learning models to predict more accurately outside of their training domain.<n>In addition to a theoretical analysis, we present empirical results showcasing the implications of this property on current deep learning architectures.
arXiv Detail & Related papers (2025-10-05T09:07:25Z)
Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models [0.523693719989689]
We introduce a novel framework designed to rigorously evaluate Vision-Language Models (VLMs) on their understanding of 2D physics.<n>Our framework features a pragmatic scenario generator that creates a diverse testbed of over 400 problems across four core domains: Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics.<n>We demonstrate a strong correlation between model scale and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving an overall score of 0.815.
arXiv Detail & Related papers (2025-09-10T04:15:01Z)
Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs [12.215295420714787]
This study investigates the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark.<n>Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation.
arXiv Detail & Related papers (2025-07-02T03:51:16Z)
Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods [11.695512384798299]
Supervised fine-tuning is the dominant approach for adapting foundation models to specialized tasks.<n>In vision models, ensembling a pretrained model with its fine-tuned counterpart has been shown to mitigate this issue.<n>We observe an overadaptation phenomenon: the ensemble model not only retains general knowledge from the foundation model but also outperforms the fine-tuned model even on the fine-tuning domain itself.
arXiv Detail & Related papers (2025-06-02T17:23:16Z)
Evaluating the Logical Reasoning Abilities of Large Reasoning Models [15.009205651973666]
We introduce LogiEval, a benchmark for evaluating logical reasoning in large reasoning models.<n>LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis)<n>Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance.<n>Our analysis reveals that human performance does not mirror model failure distributions.
arXiv Detail & Related papers (2025-05-17T05:36:14Z)
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes [84.1059652774853]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks.<n>Recent studies have exposed critical limitations in their spatial reasoning capabilities.<n>This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z)
A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z)
Low-Rank Adaptation for Foundation Models: A Comprehensive Review [56.341827242332194]
Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges.<n>This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models.
arXiv Detail & Related papers (2024-12-31T09:38:55Z)
Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review [10.325003320290547]
This systematic literature review comprehensively examines the application of Large Language Models (LLMs) in forecasting and anomaly detection. LLMs have demonstrated significant potential in parsing and analyzing extensive datasets to identify patterns, predict future events, and detect anomalous behavior across various domains. This review identifies several critical challenges that impede their broader adoption and effectiveness, including the reliance on vast historical datasets, issues with generalizability across different contexts, and the phenomenon of model hallucinations.
arXiv Detail & Related papers (2024-02-15T22:43:02Z)
The Essential Role of Causality in Foundation World Models for Embodied AI [102.75402420915965]
Embodied AI agents will require the ability to perform new tasks in many different real-world environments. Current foundation models fail to accurately model physical interactions and are therefore insufficient for Embodied AI. The study of causality lends itself to the construction of veridical world models.
arXiv Detail & Related papers (2024-02-06T17:15:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.