Fugu-MT 論文翻訳(概要): ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

論文の概要: ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2605.29438v1
Date: Thu, 28 May 2026 06:33:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.855952
Title: ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models
Title（参考訳）: ElegantVLA: 視覚-言語-行動モデルのための思考のタイミングを学習する
Authors: Ye Li, Huanan Liu, Kangye Ji, Yuan Meng, Jiajun Fan, Yuansong Wang, Shiyu Qin, Chenglei Wu, Shu-Tao Xia, Zhi Wang,
Abstract要約: VLAモデル(Vision-Language-Action Model)は、汎用的なロボット制御のための強力なパラダイムである。 ElegantVLAは、モデル内動的計算スケジューリングによってVLAモデルを高速化するプラグイン位相適応推論フレームワークである。 GR00TとCogACTの実験は最大2.55倍と3.77倍のスピードアップを実現し、6つの現実世界のGR00TタスクではElegantVLAは計算を2.18倍に削減し、制御周波数を13.8Hzから26.3Hzに引き上げた。
参考スコア（独自算出の注目度）: 46.57405778313275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.
Abstract（参考訳）: VLAモデル(Vision-Language-Action Model)は、汎用的なロボット制御のための強力なパラダイムである。しかし、高い計算コストと限られた制御周波数は、特に大きな視覚言語バックボーンと反復アクションヘッドが制御ステップ毎に実行されている場合、リアルタイムロボット操作を妨げる。既存のVLA加速法は、個々のコンポーネントを最適化したり、固定された加速規則を頼りにしたり、異なる制御手順を主に固定された計算で扱い、シーケンシャルなエンボディド制御の不均一な推論要求を見落としたりする。認知とフィードバックのリソースが目標に敏感な段階に集中する人間の運動制御にインスパイアされたVLAモデルは、いつフル計算を投資するか、いつ事前計算を再利用すべきかを学ぶべきだと論じる。モデル内動的計算スケジューリングによりVLAモデルを高速化するプラグイン位相適応型推論フレームワークであるElegantVLAを提案する。 ElegantVLAは、時間的表現類似性、ロボットモーションキュー、エピソード進行を観察する軽量スケジューラを導入し、視覚エンコーダ、LDM、アクションヘッド間で計算を共同で割り当てる。知覚言語推論では、スケジューラは視覚言語表現の安定性に基づいて、完全な再計算から多段階の時間的再利用までの5段階のVision-LLM計算モードを選択する。アクション生成では、3段階のデノナイジングモードを選択し、安定した動作中に中間のデノナイジング状態を再利用し、ゴール感応的なステージの完全なリノナイジングを保っている。これらの決定をコーディネートすることで、ElegantVLAは、ベースモデルを変更または再トレーニングすることなく、明示的なアクション生成モジュールを備えたモダンなVLAパイプラインのための一般的なアクセラレーションフレームワークを提供する。 GR00TとCogACTの実験は最大2.55倍と3.77倍のスピードアップを実現し、6つの現実世界のGR00TタスクではElegantVLAは計算を2.18倍に削減し、制御周波数を13.8Hzから26.3Hzに引き上げた。

論文の概要: ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

関連論文リスト