Fugu-MT 論文翻訳(概要): Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

論文の概要: Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

arxiv url: http://arxiv.org/abs/2606.05966v1
Date: Thu, 04 Jun 2026 10:07:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.713642
Title: Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs
Title（参考訳）: 物理推論のための因果認識:VLMにおける因果的インフォームド物理世界理解のためのベンチマーク
Authors: Tianyi Tang, Zhuoyi Lin, Zeyu Feng, Tianyi Ma, Yew-Soon Ong, Ivor Tsang, Haiyan Yin,
Abstract要約: CausalPhysは、知覚、予測、介入、目標指向という4つの領域にまたがる、ビデオと画像に基づく3000以上の慎重にキュレートされた質問のベンチマークである。各質問は、専門家がアノテートした因果グラフがオブジェクト・属性・イベントの依存関係をキャプチャし、因果理解の解釈可能かつきめ細かい評価を可能にする。これに基づいて、モデルの連鎖推論が正しい因果関係とどの程度うまく一致しているかを定量的に測定する因果グラフ基底計量を定式化する。本稿では,CRFT(Causal Rationale-informed Fine-Tuning)を提案する。
参考スコア（独自算出の注目度）: 49.55219052565761
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect answers. To address this gap, we introduce CausalPhys, a benchmark of over 3,000 carefully curated video- and image-based questions spanning four domains: Perception, Anticipation, Intervention, and Goal Orientation. Each question is paired with an expert-annotated causal graph capturing object-attribute-event dependencies, enabling interpretable and fine-grained evaluation of causal understanding. Building on this, we formulate a causal-graph-grounded metric that quantitatively measures how well a model's chain-of-thought reasoning aligns with the correct causal relations, moving beyond answer-only accuracy and enabling systematic diagnosis of VLMs' causal reasoning failures. Using this metric, we conduct a comprehensive analysis of leading VLMs, revealing systematic gaps in capturing causal dependencies and underscoring the need for causality-aware learning. To address these limitations, we further propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures. Extensive experiments demonstrate that CRFT substantially enhances both reasoning accuracy and interpretability across multiple model backbones. By unifying dataset curation, causal evaluation, and causality-informed learning, CausalPhys establishes a strong foundation for advancing modern VLMs toward causally grounded physical reasoning.
Abstract（参考訳）: 物理的世界に対する理解と推論は知的な行動の基礎であるが、最先端の視覚言語モデル(VLM)は因果的物理的推論に失敗し、しばしば妥当だが誤った答えを生み出す。このギャップに対処するため、私たちはCausalPhysという、4つの領域(知覚、予測、介入、目標指向)にまたがる、3000以上の慎重にキュレートされたビデオおよび画像ベースの質問のベンチマークを紹介した。各質問は、専門家がアノテートした因果グラフがオブジェクト・属性・イベントの依存関係をキャプチャし、因果理解の解釈可能かつきめ細かい評価を可能にする。これに基づいて、モデルの連鎖推論が正しい因果関係とどの程度うまく一致しているかを定量的に測定し、応答のみの精度を超えてVLMの因果推論失敗の体系的診断を可能にする因果グラフ基底計量を定式化する。この指標を用いて,本研究は,因果関係を捉え,因果関係を学習する必要性を浮き彫りにした,先進VLMの包括的分析を行う。これらの制約に対処するため、我々はさらに、VLM推論と因果構造を明示的に整合させる、Causal Rationale-informed Fine-Tuning (CRFT)を提案する。広範囲な実験により、CRFTは複数のモデルバックボーンにわたる推論精度と解釈可能性の両方を大幅に向上させることが示された。 CausalPhysは、データセットのキュレーション、因果評価、因果的インフォームドラーニングを統一することにより、因果的物理的推論に向けて近代的なVLMを前進させる強力な基盤を確立する。

論文の概要: Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

関連論文リスト