Fugu-MT 論文翻訳(概要): ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

論文の概要: ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

arxiv url: http://arxiv.org/abs/2604.24300v2
Date: Tue, 05 May 2026 23:43:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-07 15:17:35.448361
Title: ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
Title（参考訳）: ReVSI:VLM3次元推論の精度評価のための視覚空間情報評価の再構築
Authors: Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, Angel X. Chang,
Abstract要約: 空間知能の現在の評価は、現代の視覚言語モデル(VLM)設定下で体系的に無効にすることができる。本稿では,各QAペアが実際の入力の下で応答可能で正しいことを保証するためのベンチマークとプロトコルであるReVSIを紹介する。
参考スコア（独自算出の注目度）: 59.558706734431276
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.
Abstract（参考訳）: 空間知能の現在の評価は、現代の視覚言語モデル(VLM)設定下で体系的に無効にすることができる。まず、多くのベンチマークは、従来の3D知覚のために算出されたポイントクラウドベースの3Dアノテーションから質問応答(QA)ペアを導出する。このようなアノテーションがビデオベースの評価のための基礎的真実として扱われる場合、再構成やアノテーションアーティファクトは、ビデオや不明瞭なオブジェクトの同一性、あるいは腐敗した幾何学に依存した回答(例えば、サイズ)ではっきりと見えるオブジェクトを見逃し、誤ったあるいは曖昧なQAペアを生成する。第二に、評価はフルシーンアクセスを前提とすることが多いが、多くのVLMはスパースサンプリングされたフレーム(例:16-64)で動作しており、実際のモデル入力では多くの疑問が効果的に解決できない。我々は,各QAペアが実際の入力の下で応答可能で正しいことを保証するベンチマークとプロトコルであるReVSIを導入することにより,評価の有効性を向上させる。この目的のために、5つのデータセットから381のシーンでオブジェクトとジオメトリを再注釈し、データ品質を改善し、厳密なバイアス緩和とプロの3Dアノテーションツールによる人間の検証で全てのQAペアを再生する。さらに、複数のフレーム予算(16/32/64/all)と細粒度オブジェクトの可視性メタデータのバリエーションを提供することにより、評価制御性を向上し、制御された診断分析を可能にする。 ReVSI上での一般的なVLMとドメイン固有のVLMの評価は、事前のベンチマークによって隠蔽される体系的な障害モードを示し、より信頼性が高く、空間知能の診断的評価をもたらす。

論文の概要: ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

関連論文リスト