Fugu-MT 論文翻訳(概要): Key-Gram: Extensible World Knowledge for Embodied Manipulation

論文の概要: Key-Gram: Extensible World Knowledge for Embodied Manipulation

arxiv url: http://arxiv.org/abs/2605.18556v1
Date: Mon, 18 May 2026 15:37:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.9173
Title: Key-Gram: Extensible World Knowledge for Embodied Manipulation
Title（参考訳）: Key-Gram: 身体操作のための拡張可能な世界知識
Authors: Jingjing Fan, Siyuan Li, Botao Ren, Zhidong Deng,
Abstract要約: 身体制御では、動的視覚状態を推論しながら構成言語命令に従うモデルがますます必要となる。 Key-Gramは、言語由来の世界知識と、具体的制御のための視覚状態推論を分離する条件記憶フレームワークである。
参考スコア（独自算出の注目度）: 20.234321700038237
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its $O(1)$ lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both $π_{0}$ and $π_{0.5}$ backbones, with average relative gains of $29.5\%/9.9\%$ on RoboTwin2.0, $35.8\%/4.5\%$ on LIBERO-Plus transfer without target-domain fine-tuning, and $15.4\%/8.1\%$ on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.
Abstract（参考訳）: 身体制御では、動的視覚状態を推論しながら構成言語命令に従うモデルがますます必要となる。しかしながら、現在のヴィジュアル・ランゲージ・アクション・ポリシーとワールド・アクション・モデルでは、しばしば言語知識と、共有バックボーンまたは条件付き経路における視覚的計算とが混在し、モダリティの競争を引き起こし、バックボーンの更新に依存する知識の拡張をもたらす。本稿では,言語由来の世界知識を具体化制御のための視覚状態推論から分離する条件記憶フレームワークであるKey-Gramを紹介する。コアとなるメモリモジュールは、命令をタスク固有のキーグラムに分解し、決定論的ハッシュルックアップを通じて静的言語的先行情報を検索し、検索したエントリをコンテキスト認識ゲーティングと軽量な畳み込み融合を通じて選択された隠れ層に注入する。この設計により、バックボーンは視覚的推論とアクション推論に主容量を割くことができ、再利用可能な命令知識は拡張可能な外部メモリに格納される。論理メモリテーブルは、トレーニング中に便利にパーティショニングでき、$O(1)$ルックアップパターンのため、推論中にホストメモリに効率的に配置できる。 RoboTwin2.0、LIBERO/LIBERO-Plus、および現実世界のデュアルアーム操作において、Key-Gramは一貫して$π_{0}$と$π_{0.5}$バックボーンを改良し、RoboTwin2.0の平均相対利得は$29.5\%/9.9\%、目標ドメインの微調整なしでのLIBERO-Plus転送は$35.8\%/4.5\%、現実世界の長距離タスクでは$15.4\%/8.1\%である。これらの結果から, 外部言語記憶は, 構成的接地, 移動, 実世界の操作を改善するための, 効果的かつ拡張可能なメカニズムを提供することが示された。

論文の概要: Key-Gram: Extensible World Knowledge for Embodied Manipulation

関連論文リスト