Fugu-MT 論文翻訳(概要): From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

論文の概要: From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

arxiv url: http://arxiv.org/abs/2603.26839v1
Date: Fri, 27 Mar 2026 08:10:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.659013
Title: From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
Title（参考訳）: PixelからBFSへ:高い迷路精度は視覚的な計画にはならない
Authors: Alberto G. Rodriguez Salgado,
Abstract要約: textscMazeBenchは、9つの制御されたグループで110の手続き的に生成された迷路イメージのベンチマークである。 OpenAI, Anthropic, Google, Alibabaの16のモデル構成を評価した。 textscMazeBenchは、視覚的な計画タスクの高精度さは、人間のような空間的理解を意味するものではないことを示している。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.
Abstract（参考訳）: マルチモーダルモデルは、真の計画やトークン空間におけるブルートフォースサーチを通じて、視覚空間のタスクをどのように解決するか? そこで我々は,9つの制御グループにわたる110の手続き的に生成された迷路イメージのベンチマークである‘textsc{MazeBench}を導入し,OpenAI, Anthropic, Google, Alibabaの16のモデル構成を評価した。 GPT-5.4は91\%とGemini 3.1 Pro 79\%を解決しているが、これらのスコアは誤解を招く。モデルは通常、イメージをテキストグリッドに変換し、ステップごとに経路を列挙し、1つのタスクに対して1,710～22,818トークンを消費する。推論予算を追加せずに、すべての構成が2～12\%、20$\times$20ウルトラハード迷路でトークン制限に到達し、失敗する。質的トレースは、画像からグリッドへの変換とトークンレベルの検索、事実上のBFSの2段階の戦略を示す。テキストグリッドによるアブレーションでは、Claude Sonnet 4.6は画像上の6\%から80\%に上昇し、下流の検索から弱い視覚的抽出を分離している。グリッドの構築やグラフ検索を行わないよう明示的に指示された場合、モデルは同じ列挙戦略に戻される。したがって、視覚的計画課題における高い精度は、人間のような空間的理解を暗示しない。

論文の概要: From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

関連論文リスト