Fugu-MT 論文翻訳(概要): From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

論文の概要: From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.19131v1
Date: Thu, 19 Mar 2026 16:49:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.278116
Title: From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models
Title（参考訳）: 推論効率から身体効率へ:ビジョン・ランゲージ・アクションモデルにおける効率指標の再検討
Authors: Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan, Lin, Chaojian Li,
Abstract要約: VLA(Vision-Language-Action)モデルは最近、エンボディエージェントがますます複雑なタスクを実行できるようにした。現在のVLA研究における「効率性」の概念は,ロボットプラットフォーム上での実際の性能を反映していない。
参考スコア（独自算出の注目度）: 5.744219633980964
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルでは、視覚、言語、運動のモダリティを共同で推論することで、エンボディエージェントがより複雑なタスクを実行できるようになった。しかし,現在のVLA研究では,パラメータやFLOP,トークン復号処理のスループットといった「効率」の概念がロボットプラットフォーム上での実際の性能を反映していないことが判明した。実世界の実行においては、効率はタスク完了時間、軌道の滑らかさ、累積関節回転、運動エネルギーなどのシステムレベルの実施行動によって決定される。モデル圧縮、トークンスペーシフィケーション、アクションシーケンス圧縮に関する制御された研究を通じて、一般的な仮定に挑戦するいくつかの観察を行う。 1)従来の測定値による計算の削減は,タスク成功率を維持しつつも,エンドツーエンドの実行コストや動作品質を低下させることが多い。 2) システムレベルの実施効率指標は,従来の評価では隠れたままの学習行動方針における性能差を明らかにした。 3) インコンテキスト・プロンプトや教師付き微調整などの一般的な適応手法は, 実施効率が軽度で, メートル法固有の改善しか示さない。これらの手法は、ジャークやアクションレートのような標的の実施効率の指標を減らすことができるが、結果として得られる利益は、長い完了時間などの他の指標のトレードオフを伴う可能性がある。この結果から,従来の推論効率の指標は具体的実行の重要な側面を覆い隠すことが示唆された。実施効率を組み込むことは、より完全な政策行動と実践的なパフォーマンスのビューを提供し、VLAモデルのより公平で包括的な比較を可能にする。

論文の概要: From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

関連論文リスト