Fugu-MT 論文翻訳(概要): ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

論文の概要: ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2602.17951v1
Date: Fri, 20 Feb 2026 03:06:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-23 18:01:41.211392
Title: ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models
Title（参考訳）: ROCKET:空間認識型視覚・言語・行動モデルのための残差指向型多層アライメント
Authors: Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li,
Abstract要約: VLA(Vision-Language-Action)モデルは命令追従ロボット操作を可能にするが、通常は2Dデータで事前訓練され、3D空間理解が欠如している。本稿では,残差指向型多層配向フレームワークROCKETを紹介する。 ROCKETは共有プロジェクタを使用して、VLAバックボーンの複数のレイヤと強力な3Dビジョン基盤モデルの複数のレイヤをアライメントする。
参考スコア（独自算出の注目度）: 12.221605970492645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは命令追従ロボット操作を可能にするが、通常は2Dデータで事前訓練され、3D空間理解が欠如している。効果的なアプローチは表現アライメントであり、強力な視覚基盤モデルを使用して2次元VLAモデルを導出する。しかし、既存の手法は通常、単一の層のみに監督を適用し、奥行きに分散した豊富な情報を十分に活用することができず、一方、ナビブ多層アライメントは勾配干渉を引き起こす可能性がある。本稿では,残差ストリームを他のストリームにアライメントするように多層アライメントを定式化する,残差指向多層配向フレームワークROCKETを紹介する。具体的には、ROCKETは共有プロジェクタを使用して、VLAバックボーンの複数のレイヤと強力な3Dビジョン基盤モデルの複数のレイヤを階層不変のマッピングによって整列し、勾配の衝突を低減する。我々は,共有プロジェクタが十分であることを示す理論的正当化と実証的解析の両方を提供し,また,共有プロジェクタが複数のアライメント損失のバランスをとるために,共有プロジェクタのマトリリシカ式スパースアクティベーションスキームを提案する。実験の結果、ROCKETはトレーニング不要層選択戦略と組み合わせて、計算予算の約4%しか必要とせず、98.5%がLIBEROで成功していることがわかった。さらに、LIBERO-Plus と RoboTwin をまたいだ ROCKET の優れた性能と、複数の VLA モデルについて述べる。コードとモデルの重み付けはhttps://github.com/CASE-Lab-UMD/ROCKET-VLAで確認できる。

論文の概要: ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

関連論文リスト