Fugu-MT 論文翻訳(概要): DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

論文の概要: DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

arxiv url: http://arxiv.org/abs/2602.22896v3
Date: Tue, 17 Mar 2026 07:08:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.734311
Title: DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation
Title（参考訳）: DySL-VLA:ロボットマニピュレーションのための動的静的層スキッピングによる効率的な視覚・言語・行動モデル推論
Authors: Zebin Yang, Yijiahao Qi, Tong Xie, Bo Yu, Shaoshan Liu, Meng Li,
Abstract要約: 本稿では,各アクションの重要性に基づいて動的にVLA層をスキップすることで,計算コストに対処する新しいフレームワークDySL-VLAを提案する。実験の結果,DySL-VLAはCalvinデータセット上のDeer-VLAよりも2.1%向上していることがわかった。
参考スコア（独自算出の注目度）: 7.958222488148539
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、言語モデルの推論を視覚モデルの3D理解と融合させることによって、操作のようなロボットタスクにおいて顕著な成功を収めた。しかし、その高い計算コストは、リアルタイム性能を必要とする実世界のアプリケーションにとって大きな障害であり続けている。重要なステップは高い精度を要求するが、重要でないステップはよりばらつきを許容できる。この知見を生かして,各アクションの重要性に基づいて動的にVLA層をスキップすることで計算コストに対処する新しいフレームワークDySL-VLAを提案する。 DySL-VLAは、その層を2つのタイプに分類する。精度を犠牲にすることなく、インテリジェントに層をスキップするために、いつ層をスキップするかを判断する事前スキップ誘導機構を発明する。また,標準的なVLAをDySL-VLAに効率よく学習するための,スキップ対応二段階知識蒸留アルゴリズムを提案する。実験の結果,DySL-VLAはCalvinデータセット上でDeer-VLAよりも2.1%向上し,トレーニング可能なパラメータを85.7倍に削減し,同精度でRoboFlamingoベースラインに対して3.75倍の高速化を実現した。私たちのコードはhttps://github.com/PKU-SEC-Lab/DYSL_VLAで公開されています。

論文の概要: DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

関連論文リスト