Fugu-MT 論文翻訳(概要): Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary

論文の概要: Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary

arxiv url: http://arxiv.org/abs/2603.11410v1
Date: Thu, 12 Mar 2026 00:52:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.738126
Title: Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary
Title（参考訳）: 方向性を見ない - MLLM の体系的指向障害に対する認知的根拠に基づくベンチマーク
Authors: Nazia Tasnim, Keanu Nichols, Yuting Yang, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, Bryan A. Plummer,
Abstract要約: 現在の視覚言語ベンチマークは、方向と位置と一般的なシーン理解とを概ね説明している。本稿では,オブジェクト指向を主ターゲットとする階層型ベンチマークである識別指向推論インテリジェンス(DORI)を紹介する。 DORIは、現実世界および合成環境で67のオブジェクトカテゴリをカバーする、33,656の多重選択質問を提供する。
参考スコア（独自算出の注目度）: 24.852775714606224
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.
Abstract（参考訳）: 人間は、物体がどの方向を向いているかを認識することから、それを精神的に回転させ、物体間の方向について推論することまで、段階的に物体の向きを学習する。現在の視覚言語ベンチマークは、方向と位置と一般的なシーン理解とを概ね説明している。本稿では,オブジェクト指向を主ターゲットとする認知的階層型ベンチマークである差別指向推論インテリジェンス(DORI)を紹介する。 DORIは、人間の指向認知の段階から着想を得て、向きを4次元に分解し、それぞれが粗い(カテゴリー)と粒度(測定値)で評価する。 DORIは14のソースにわたる13,652の画像で構成され、現実世界と合成環境における67のオブジェクトカテゴリをカバーする33,656の多重選択質問を提供する。粗い粒度の設計は、オブジェクト認識の難易度、シーンのクラッタ、境界ボックス分離、標準化された空間参照フレーム、構造化プロンプトなどによる言語的あいまいさなどの相違点から向きを分離する。 24の最先端の視覚言語モデルを評価すると、明確なパターンが示される: 一般的な空間的ベンチマークでうまく機能するモデルは、オブジェクト指向指向タスクではほとんどランダムである。最良のモデルは粗さで54.2%、粒度の判断で45.0%にしか達せず、複合回転やオブジェクト間の参照フレームのシフトで最大の失敗がある。大きな粗粒間ギャップは、既存のベンチマークによって隠された制限である幾何学的推論よりもカテゴリー的ヒューリスティックに頼っていることを示している。これらの結果は,ロボット操作,3次元シーン再構築,人間とAIのインタラクションなど,多モードシステムにおけるオリエンテーション理解を未解決の課題とみなす。

論文の概要: Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary

関連論文リスト