Fugu-MT 論文翻訳(概要): PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

論文の概要: PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.29281v1
Date: Tue, 31 Mar 2026 05:29:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.172678
Title: PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models
Title（参考訳）: PRISM: 身体視覚言語モデルのための多視点多機能リテールビデオデータセット
Authors: Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi,
Abstract要約: 実店舗環境における視覚言語モデル(VLM)を具現化した270KのマルチビュービデオコーパスであるPRISMを提案する。プリズムは単純な観察によって動機づけられる - 物理的なAIシステムは、空間、物理的ダイナミクス、そして世界で確実に動作するのに十分な身体的行動を理解していないため失敗する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism
Abstract（参考訳）: 最先端の物理AIモデルの汎用的な視覚的理解と、構造化された現実世界のデプロイメント環境の特殊な知覚的要求との間には、重要なギャップがある。実店舗環境における視覚言語モデル(VLM)を具現化した270KのマルチビュービデオコーパスであるPRISMを提案する。 PRISMは単純な観察によって動機付けられている - 物理的なAIシステムは視覚的認識が貧弱なためではなく、空間、物理的ダイナミクス、そして世界で確実に動作する十分な身体的行動が理解できないために失敗する。この目的のために、PRISMは空間的知識、時間的および物理的知識、そして行動知識を具体化する新しい3次元知識オントロジーに基礎を置いている。 Embodied Reasoning(ER)、Common Sense(CS)、Spatial Perception(SP)、Intuitive Physics(IP)の4つの評価次元にまたがる20以上の能力プローブをカバーしています。コーパスは、エゴセントリックで、エクソセントリックで、360度の視点で、5つのスーパーマーケットでデータをキャプチャし、オープンエンド、チェーンオブソート、複数選択の監視を含む。 4fpsのPRISMは、約11.8Mのビデオフレームと約730Mのトークンにまたがっており、ドメイン固有のビデオSFTコーパスの中では最大である。 PRISMの微調整は、事前訓練されたベースライン上での20以上のプローブの誤差率を66.6%削減し、精度が36.4%向上した実施された動作理解の精度が大幅に向上した。本研究の結果から,オントロジー構造を持つドメイン固有SFTは実世界設定のためのエンボディ型VLMを有意に強化できる可能性が示唆された。 PRISMデータセットと詳細はhttps://dreamvu.ai/prismで確認できる。

論文の概要: PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

関連論文リスト