Fugu-MT 論文翻訳(概要): PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

論文の概要: PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

arxiv url: http://arxiv.org/abs/2604.24443v1
Date: Mon, 27 Apr 2026 13:10:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.014069
Title: PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
Title（参考訳）: PhysNote:視覚言語モデルにおける進化可能な物理推論のための自己知識ノート
Authors: Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang,
Abstract要約: 我々は、視覚言語モデルが自己生成した「知識ノート」を通して身体的知識を外部化し、洗練することを可能にするエージェントフレームワークであるPhysNoteを提案する。 PhysNoteは階層的時間的正準化を通じて動的知覚を安定化させ、知識リポジトリに自己生成的な洞察を整理し、検証された知識を統合する前に視覚的証拠の仮説を根拠とする反復的推論ループを駆動する。
参考スコア（独自算出の注目度）: 18.848171847916696
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio-temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference-time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self-generated "Knowledge Notes." PhysNote stabilizes dynamic perception through spatio-temporal canonicalization, organizes self-generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi-agent baseline, with consistent gains across all four physical reasoning domains.
Abstract（参考訳）: VLM(Vision-Language Models)は、教科書スタイルの物理問題において強力な性能を示してきたが、時間的一貫性とフレーム間の因果推論を必要とする動的な現実のシナリオに直面すると、しばしば失敗する。 1) 物体が連続するフレーム間で物理的アイデンティティを失い、因果連鎖を破る時空間的アイデンティティドリフト、(2) 推論時の洞察のボラティリティ、そして、モデルが時に正しい物理的推論を生成するが、将来の再利用のためにそれを統合しない時空間的アイデンティティドリフトである。これらの課題に対処するために,VLMが自己生成した「知識ノート」を通じて物理知識の外部化と洗練を可能にするエージェントフレームワークであるPhysNoteを提案する。 PhysNoteは時空間正準化を通じて動的知覚を安定化させ、自己生成的な洞察を階層的な知識リポジトリに整理し、検証された知識を統合する前に視覚的証拠を根拠とする反復的推論ループを駆動する。 PhysBenchの実験では、PhysNoteの全体的な精度は56.68%で、最高のマルチエージェントベースラインよりも4.96%向上し、4つの物理的推論領域で一貫した利得が得られた。

論文の概要: PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

関連論文リスト