Fugu-MT 論文翻訳(概要): Vision Language Models Cannot Plan, but Can They Formalize?

論文の概要: Vision Language Models Cannot Plan, but Can They Formalize?

arxiv url: http://arxiv.org/abs/2509.21576v1
Date: Thu, 25 Sep 2025 20:55:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.008548
Title: Vision Language Models Cannot Plan, but Can They Formalize?
Title（参考訳）: 視覚言語モデルは計画できないが、形式化できるのか?
Authors: Muyu He, Yuxi Zheng, Yuchen Liu, Zijian An, Bill Cai, Jiani Huang, Lifeng Zhou, Feng Liu, Ziyang Li, Li Zhang,
Abstract要約: 本稿では,1ショット,オープンボキャブラリ,マルチモーダルPDDL形式化に対処する5つのVLM-as-formalizerパイプラインについて述べる。 VLMは、必要となるオブジェクト関係の網羅的な集合を捕捉できないことが多いため、言語よりも視覚的なボトルネックを明らかにする。
参考スコア（独自算出の注目度）: 28.52711774279781
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The advancement of vision language models (VLMs) has empowered embodied agents to accomplish simple multimodal planning tasks, but not long-horizon ones requiring long sequences of actions. In text-only simulations, long-horizon planning has seen significant improvement brought by repositioning the role of LLMs. Instead of directly generating action sequences, LLMs translate the planning domain and problem into a formal planning language like the Planning Domain Definition Language (PDDL), which can call a formal solver to derive the plan in a verifiable manner. In multimodal environments, research on VLM-as-formalizer remains scarce, usually involving gross simplifications such as predefined object vocabulary or overly similar few-shot examples. In this work, we present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization. We evaluate those on an existing benchmark while presenting another two that for the first time account for planning with authentic, multi-view, and low-quality images. We conclude that VLM-as-formalizer greatly outperforms end-to-end plan generation. We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations. While generating intermediate, textual representations such as captions or scene graphs partially compensate for the performance, their inconsistent gain leaves headroom for future research directions on multimodal planning formalization.
Abstract（参考訳）: 視覚言語モデル(VLM)の進歩は、単純なマルチモーダル計画タスクを達成するために、エンボディードエージェントに権限を与えてきたが、長いアクションのシーケンスを必要とするロングホライゾンではない。テキストのみのシミュレーションでは、LLMの役割を再配置することによって、長期計画が大幅に改善されている。アクションシーケンスを直接生成する代わりに、LCMは計画ドメインと問題を計画ドメイン定義言語(PDDL)のような形式的な計画言語に変換する。マルチモーダル環境では、VLM-as-formalizerの研究は依然として不十分であり、通常、事前定義されたオブジェクト語彙や、過度に類似した少数ショットの例のような大まかな単純化が伴う。本稿では,1ショット,オープンボキャブラリ,マルチモーダルPDDL形式化に対応する5つのVLM-as-formalizerパイプラインについて述べる。我々は、既存のベンチマークでそれらを評価し、また、信頼性、マルチビュー、低品質の画像で計画する上で初めて考慮すべき2つのことを提示する。 VLM-as-formalizerは、エンド・ツー・エンドのプラン生成よりも大幅に優れていると結論付けている。 VLMは、必要となるオブジェクト関係の網羅的な集合を捕捉できないことが多いため、言語よりも視覚的なボトルネックを明らかにする。キャプションやシーングラフのような中間的なテキスト表現を生成することは、部分的にパフォーマンスを補うが、その矛盾した利得は、将来のマルチモーダルプランニングの形式化研究の方向に向かう。

論文の概要: Vision Language Models Cannot Plan, but Can They Formalize?

関連論文リスト