Fugu-MT 論文翻訳(概要): Grounding Video Reasoning in Physical Signals

論文の概要: Grounding Video Reasoning in Physical Signals

arxiv url: http://arxiv.org/abs/2604.21873v1
Date: Thu, 23 Apr 2026 17:17:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.780562
Title: Grounding Video Reasoning in Physical Signals
Title（参考訳）: 物理信号のグラウンドビデオ推論
Authors: Alibay Osmanli, Zixu Cheng, Shaogang Gong,
Abstract要約: 物理ビデオ理解のためのグラウンドドベンチマークを導入する。ベンチマークには、SSV2、YouCook2、HoloAssist、Roundabout-TAUからの1,560のベースビデオクリップが含まれている。モデルと家族の至る所で、物理学は全体として最強の体制を維持している。
参考スコア（独自算出の注目度）: 22.667135960633697
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video Q&A reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.
Abstract（参考訳）: 物理的ビデオ理解には、イベントを正しく命名する以上のものが必要だ。モデルは、時間や空間におけるイベントのローカライズに失敗したまま、テキストの正規性からの注ぐ、スライドする、あるいは衝突に関する質問に答えることができます。 V-STaRの評価構造を4つのビデオソース,6つの物理領域,3つのプロンプトファミリー(物理,vstar_like,neutral_rstr),4つの入力条件(元,シャッフル,アブレーション,フレームメイク)に拡張した物理ビデオ理解のための基盤的ベンチマークを提案する。ベンチマークには、SSV2、YouCook2、HoloAssist、Roundabout-TAUからの1,560のベースビデオクリップが含まれている。各クリップは、まず共有接地されたイベントレコードに変換され、3つのクエリーファミリはそのレコードから導出される。時間的および空間的対象は、即時的な家族間で共有されるが、非物理学的な家族は、同じレコードから派生した決定論的家族に適した意味的a_whatを使用する。 vstar_likeは最も明確な非物理学的セマンティック比較であり、中性_rstrはより厳しいテンプレート化された制御として振る舞う。プロンプト系のロバスト性は普遍的ではなく選択的であり、摂動は元の弱い場合においてクラスターを得る。これらの結果から,ビデオQ&A推論ベンチマークでは,集約精度とともに,身体的根拠,即時認識,摂動認識の診断を報告することが示唆された。

論文の概要: Grounding Video Reasoning in Physical Signals

関連論文リスト