Fugu-MT 論文翻訳(概要): Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

論文の概要: Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2602.04228v1
Date: Wed, 04 Feb 2026 05:37:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.160519
Title: Reshaping Action Error Distributions for Reliable Vision-Language-Action Models
Title（参考訳）: 信頼性ビジョンランゲージ・アクションモデルに対するリフォーミング動作誤差分布
Authors: Shuanghao Bai, Dakai Wang, Cheng Chi, Wanqi Zhou, Jing Lyu, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Lei Xing, Shanghang Zhang, Badong Chen,
Abstract要約: ロボット操作において、視覚言語アクション(VLA)モデルは、一般化可能でスケーラブルなロボットポリシーを学ぶための有望なパラダイムとして登場した。連続動作型VLAモデルに焦点をあて、トレーニング中の動作誤差分布を再構成することにより、従来のMSEベースの回帰を超越する。複数の代表的VLAアーキテクチャ上で、標準、少数ショット、ノイズの多い設定にまたがるアプローチを評価します。
参考スコア（独自算出の注目度）: 69.38615670891038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In robotic manipulation, vision-language-action (VLA) models have emerged as a promising paradigm for learning generalizable and scalable robot policies. Most existing VLA frameworks rely on standard supervised objectives, typically cross-entropy for discrete actions and mean squared error (MSE) for continuous action regression, which impose strong pointwise constraints on individual predictions. In this work, we focus on continuous-action VLA models and move beyond conventional MSE-based regression by reshaping action error distributions during training. Drawing on information-theoretic principles, we introduce Minimum Error Entropy (MEE) into modern VLA architectures and propose a trajectory-level MEE objective, together with two weighted variants, combined with MSE for continuous-action VLA training. We evaluate our approaches across standard, few-shot, and noisy settings on multiple representative VLA architectures, using simulation benchmarks such as LIBERO and SimplerEnv as well as real-world robotic manipulation tasks. Experimental results demonstrate consistent improvements in success rates and robustness across these settings. Under imbalanced data regimes, the gains persist within a well-characterized operating range, while incurring negligible additional training cost and no impact on inference efficiency. We further provide theoretical analyses that explain why MEE-based supervision is effective and characterize its practical range. Project Page: https://cognition2actionlab.github.io/VLA-TMEE.github.io/
Abstract（参考訳）: ロボット操作において、視覚言語アクション(VLA)モデルは、一般化可能でスケーラブルなロボットポリシーを学ぶための有望なパラダイムとして登場した。既存のVLAフレームワークの多くは、標準的な監視対象に依存しており、通常、離散的なアクションにはクロスエントロピー、連続的なアクション回帰には平均2乗誤差(MSE)がある。本研究では,連続動作型VLAモデルに焦点をあて,トレーニング中の動作誤差分布を変形させることにより,従来のMSEに基づく回帰を克服する。情報理論の原則を基礎として,最新のVLAアーキテクチャに最小誤差エントロピー(MEE)を導入し,2つの重み付き変種と連続動作型VLAトレーニングのためのMSEを組み合わせた軌道レベルのMEE目標を提案する。 LIBERO や SimplerEnv などのシミュレーションベンチマークや実世界のロボット操作タスクを用いて,複数の代表的 VLA アーキテクチャ上での標準的,少数ショット,ノイズの多い設定によるアプローチの評価を行った。実験結果は、これらの設定における成功率と堅牢性に一貫した改善を示す。不均衡なデータ体制の下では、利得は十分な特性を持つ運用範囲内に留まり、無視できない追加のトレーニングコストを発生させ、推論効率に影響を与えない。さらに、MEEに基づく監督がなぜ有効かを説明する理論的分析を行い、その実践範囲を特徴付ける。プロジェクトページ: https://cognition2actionlab.github.io/VLA-TMEE.github.io/

論文の概要: Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

関連論文リスト