Fugu-MT 論文翻訳(概要): StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

論文の概要: StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

arxiv url: http://arxiv.org/abs/2605.18287v1
Date: Mon, 18 May 2026 12:15:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.597112
Title: StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
Title（参考訳）: StableVLA:余分なデータのないロバストなビジョンランゲージ・アクションモデルを目指して
Authors: Yiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng, Kaiwei Sun, Qiyang Min, Qibin Hou, Yansong Tang, Jianan Wang, Daquan Zhou,
Abstract要約: トレーニングデータセット内のあらゆる障害を包含することは不可能である。このことは、目に見えない現実世界の視覚障害に遭遇する際のビジョン・ランゲージ・アクション(VLA)モデルの堅牢性に関する批判的な疑問を提起する。本研究では、最近の最先端VLAモデルに基づく系統的な研究を行い、トレーニングデータに欠落した視覚障害が発生した場合に顕著な性能低下を示す。
参考スコア（独自算出の注目度）: 68.81275738717765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.
Abstract（参考訳）: トレーニングデータセット内のあらゆる障害を包含することは不可能である。このことは、特に不完全な視覚条件下で、目に見えない現実世界の視覚障害に遭遇する際のビジョン・ランゲージ・アクション(VLA)モデルの堅牢性に関する批判的な疑問を提起する。本研究では、最近の最先端VLAモデルに基づく系統的な研究を行い、トレーニングデータに欠落した視覚障害が発生した場合に顕著な性能低下を示す。この問題を軽減するために,情報理論を基盤としたIB-Adapter (Information Bottleneck Adapter) と呼ばれる,視覚入力から潜在的なノイズを選択的にフィルタする軽量アダプタモジュールを提案する。 IB-Adapterは、追加のデータや拡張戦略を必要とせずに、平均30%のベースラインを継続的に改善し、1000万以上のパラメータを追加し、顕著な効率性と有効性を示している。さらに,14倍小さいバックボーン(0.5Bパラメータ)とOpen X-Embodimentデータセットの事前トレーニングがなくても,StableVLAは7BスケールのVLAと競合する堅牢性を達成できる。パラメータオーバーヘッドが無視できる(<10M) では, 長軸タスクの精度を保ち, 合成と物理の両方の視覚的汚濁の下でOpenPiを上回っている。

論文の概要: StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

関連論文リスト