Fugu-MT 論文翻訳(概要): Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

論文の概要: Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

arxiv url: http://arxiv.org/abs/2605.16848v1
Date: Sat, 16 May 2026 07:12:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.200849
Title: Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction
Title（参考訳）: パターンによる思考:パターン誘導による視覚計画における知覚的ボツネックの破滅
Authors: Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding,
Abstract要約: 生の視覚入力からのプランニングは、視覚言語モデル(VLM)にとって重要な課題である。我々は、Thinking with Images(TWI)を、徐々に正確な内部世界モデルを構築し、反映するツールとして定式化する。我々は,新しいタスクにおいて,VLMが既知の視覚的パターンを積極的に認識することを可能にする新しいTWI戦略であるPattern Inferenceを提案する。
参考スコア（独自算出の注目度）: 5.489090549883847
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.
Abstract（参考訳）: 生の視覚入力からのプランニングは、現在の視覚言語モデル(VLM)にとって重要な課題であり、入力の複雑さは1段階の知覚能力を超えている。近年のThinking with Images (TWI)の進歩に触発された合理的な解決策は、局所的な視覚的エビデンスを反復的に取得し、組み込むことにより、知覚プロセスをより単純なステップに分解することである。しかしながら、現在のVLMは一般的なTWI能力で十分に訓練されているにもかかわらず、計画領域における知覚的ボトルネックは残っている。この課題に対処するため、我々はTWIを、正確に内部世界モデルを構築し、反映するツールとして定式化する。結果として、トレーニング不要な計画戦略により、多くのTWI操作が計算オーバーヘッドを大幅に増加させるため、VLMが初期能力を超えているタスクを解決できることが判明した。本稿では,VLMが新しいタスクにおける既知の視覚的パターンを積極的に認識し,局所世界モデル構造を直接推論することを可能にする新しいTWI戦略であるPattern Inferenceを提案する。これらのパターンを得るために、視覚パターンを総合的かつ再利用可能な専門家として扱うオンライン帰納学習戦略であるPattern Injectionを提案し、経験から自律的に発見し、最適化する。 FrozenLake, Crafter, CubeBench ドメインでの実験評価により, 提案手法は精度と効率のバランスが望ましいことを示した。

論文の概要: Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

関連論文リスト