Fugu-MT 論文翻訳(概要): Do multimodal models imagine electric sheep?

論文の概要: Do multimodal models imagine electric sheep?

arxiv url: http://arxiv.org/abs/2605.09693v1
Date: Sun, 10 May 2026 18:25:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.375046
Title: Do multimodal models imagine electric sheep?
Title（参考訳）: マルチモーダルモデルは電気羊を想像する?
Authors: Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Krähenbühl, Vladlen Koltun,
Abstract要約: 12種類の視覚的推論タスクを解決するために、Qwen3.5 VLMを微調整する。各アクション後のモデルのアクティベーションは、中間状態に関する有意義な視覚情報を符号化していることを示す。 1ステップあたり16個のビジュアルトークンを統合することで,平均解率を83%から89%に向上することがわかった。
参考スコア（独自算出の注目度）: 99.83000217195644
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.
Abstract（参考訳）: はい。大規模なマルチモーダルモデルは、空間パズルを解く際に精神イメージを発達させ、羊パズルを解く際に羊を想像する。タングラム,ジグソー,ソコバン,3次元の心的回転,ラッシュアワーを含む12種類の視覚的推論タスクを,幾何学的,空間的関係,行動の結果を理解する必要があるように,Qwen3.5 VLMを微調整する。初期状態からパズルを解くための動作のオープンループ列を予測するモデルを監督することにより、各動作後のモデルのアクティベーションが、中間状態に関する有意義な視覚情報を符号化していることを示す。この発見は、明確な視覚的監督がなければ、不完全な視覚世界モデルが正しい行動を選択するための学習の副産物として形成され始めることを示唆している。そこで本研究では,モデルが生成した心的イメージを鮮明化し,活用するための2つの方法を提案する。その結果,1ステップあたり16個の視覚トークンを思考連鎖に組み込むことで,平均解解率が83%から89%に向上することがわかった。

論文の概要: Do multimodal models imagine electric sheep?

関連論文リスト