Fugu-MT 論文翻訳(概要): Vision-Language-Action Model, Robustness, Multi-modal Learning, Robot Manipulation

論文の概要: Vision-Language-Action Model, Robustness, Multi-modal Learning, Robot Manipulation

arxiv url: http://arxiv.org/abs/2604.10055v1
Date: Sat, 11 Apr 2026 06:37:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:15.814916
Title: Vision-Language-Action Model, Robustness, Multi-modal Learning, Robot Manipulation
Title（参考訳）: 視覚言語行動モデル,ロバストネス,マルチモーダル学習,ロボット操作
Authors: Yuhan Xie, Yuping Yan, Yunqi Zhao, Handing Wang, Yaochu Jin,
Abstract要約: 本稿では、VLA(Vision-Language-Action)モデルのための切り離された微調整フレームワークSTRONG-VLAを提案する。ステージIでは、モデルは困難が増す多モーダル摂動のカリキュラムに晒される。ステージIIでは、モデルはクリーンなタスク分布と整合して、堅牢性を維持しながら実行の忠実さを回復します。 LIBEROベンチマークの実験では、STRONG-VLAは複数のVLAアーキテクチャにおけるタスク成功率を一貫して改善している。
参考スコア（独自算出の注目度）: 26.063335767640083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite their strong performance in embodied tasks, recent Vision-Language-Action (VLA) models remain highly fragile under multimodal perturbations, where visual corruption and linguistic noise jointly induce distribution shifts that degrade task-level execution. Existing robustness approaches typically rely on joint training with perturbed data, treating robustness as a static objective, which leads to conflicting optimization between robustness and task fidelity. In this work, we propose STRONG-VLA, a decoupled fine-tuning framework that explicitly separates robustness acquisition from task-aligned refinement. In Stage I, the model is exposed to a curriculum of multimodal perturbations with increasing difficulty, enabling progressive robustness learning under controlled distribution shifts. In Stage II, the model is re-aligned with clean task distributions to recover execution fidelity while preserving robustness. We further establish a comprehensive benchmark with 28 perturbation types spanning both textual and visual modalities, grounded in realistic sources of sensor noise, occlusion, and instruction corruption. Extensive experiments on the LIBERO benchmark show that STRONG-VLA consistently improves task success rates across multiple VLA architectures. On OpenVLA, our method achieves gains of up to 12.60% under seen perturbations and 7.77% under unseen perturbations. Notably, similar or larger improvements are observed on OpenVLA-OFT (+14.48% / +13.81%) and pi0 (+16.49% / +5.58%), demonstrating strong cross-architecture generalization. Real-world experiments on an AIRBOT robotic platform further validate its practical effectiveness. These results highlight the importance of decoupled optimization for multimodal robustness and establish STRONG-VLA as a simple yet principled framework for robust embodied control.
Abstract（参考訳）: 近年のVision-Language-Action(VLA)モデルでは,視覚障害や言語ノイズが相まって分散シフトを誘発し,タスクレベルの実行を低下させるという多モーダル摂動下では脆弱な状態が保たれている。既存のロバストネスアプローチは、典型的には摂動データとのジョイントトレーニングに依存し、ロバストネスを静的な目的として扱う。本稿では,タスク整合性向上からロバスト性獲得を明確に分離する,切り離した微調整フレームワークSTRONG-VLAを提案する。ステージIでは,多モーダル摂動のカリキュラムが複雑化し,制御された分布シフト下での進行的頑健性学習が可能となる。ステージIIでは、モデルはクリーンなタスク分布と整合して、堅牢性を維持しながら実行の忠実さを回復します。さらに,テキストと視覚の両モードにまたがる28種類の摂動型を総合的に評価する。 LIBEROベンチマークの大規模な実験により、STRONG-VLAは複数のVLAアーキテクチャにおけるタスク成功率を一貫して改善することが示された。 OpenVLAでは、目に見える摂動では最大12.60%、目に見えない摂動では7.77%の利得が得られる。特に、OpenVLA-OFT (+14.48% / +13.81%) と pi0 (+16.49% / +5.58%) で同様のまたは大きな改善が見られ、強いクロスアーキテクチャの一般化が示されている。 AIRBOTロボットプラットフォームにおける実世界の実験は、その実用性をさらに検証する。これらの結果は、マルチモーダルなロバスト性に対する分離最適化の重要性を強調し、ロバストなエンボディド制御のためのシンプルだが原則化されたフレームワークとしてSTRONG-VLAを確立した。

論文の概要: Vision-Language-Action Model, Robustness, Multi-modal Learning, Robot Manipulation

関連論文リスト