Fugu-MT 論文翻訳(概要): Grounded World Model for Semantically Generalizable Planning

論文の概要: Grounded World Model for Semantically Generalizable Planning

arxiv url: http://arxiv.org/abs/2604.11751v1
Date: Mon, 13 Apr 2026 17:25:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.715754
Title: Grounded World Model for Semantically Generalizable Planning
Title（参考訳）: 意味的に一般化可能な計画のための接地世界モデル
Authors: Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, Harold Soh,
Abstract要約: 我々は、視覚言語対応の潜在空間において、グラウンドドワールドモデル(GWM)を学習する。提案された各アクションは、タスク命令に対する将来の結果がどの程度近いかに基づいてスコアされる。提案したWISERベンチマークでは、GWM-MPCはテストセットで87%の成功率を達成した。
参考スコア（独自算出の注目度）: 94.53923128709965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.
Abstract（参考訳）: モデル予測制御(MPC)では、世界モデルは様々なアクション提案の将来の結果を予測する。ビジュモータMPCでは、スコア関数は、DINOやJEPAのような事前訓練された視覚エンコーダの潜時空間で測定された予測画像と目標画像との距離メートルである。しかし,特に新しい環境において,タスク実行に先立って目標画像を取得することは困難である。さらに、画像を通して目標を伝達することは、自然言語と比較して限定的な対話性を提供する。本研究では,視覚言語対応の潜在空間において,グラウンドド・ワールド・モデル(GWM)を学習することを提案する。その結果、各アクションは、埋め込みの類似性によって反映されるタスク命令に、その将来の結果がどの程度近いかに基づいてスコアされる。このアプローチは、ビジュモータ MPC を、意味一般化において VLM ベースの VLA を超える VLA に変換する。提案したWISERベンチマークでは、GWM-MPCは、見えない視覚信号と参照表現を特徴とする288のタスクからなるテストセットで87%の成功率を達成するが、トレーニング中に示された動作で解決可能である。対照的に、従来のVLAは、90%の成功率でトレーニングセットを過度に適合させたとしても、平均的な成功率は22%に達する。

論文の概要: Grounded World Model for Semantically Generalizable Planning

関連論文リスト