Fugu-MT 論文翻訳(概要): Adaptive Capacity Allocation for Vision Language Action Fine-tuning

論文の概要: Adaptive Capacity Allocation for Vision Language Action Fine-tuning

arxiv url: http://arxiv.org/abs/2603.07404v1
Date: Sun, 08 Mar 2026 01:33:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.407935
Title: Adaptive Capacity Allocation for Vision Language Action Fine-tuning
Title（参考訳）: 視覚言語行動微調整のための適応的容量割当
Authors: Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, Byonghyo Shim,
Abstract要約: 視覚言語アクションモデル(VLA)は、物理AIにますます使われているが、未確認環境に事前訓練されたVLAモデルをデプロイするには、まだ適応が必要である。固定ランク更新を入力および層単位のキャパシティに置き換えるランク適応微調整法であるLoRA-SPを提案する。目に見えないAgileX PiPERのアームで収集された4つの実ロボット操作タスクでは、LoRA-SPはトレーニング可能なパラメータがはるかに少ない完全な微調整にマッチするか、超える。
参考スコア（独自算出の注目度）: 30.782665306687992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., $r \in \{4, 8\}$), while spectral analyses indicate VLAs may require much larger ranks (e.g., $r \approx 128$) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores $E(k) \ge η$, providing a direct link to approximation error via our spectral analysis. During training, $η$ concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones ($π_0$ and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.
Abstract（参考訳）: 視覚言語アクションモデル(VLA)は、物理AIにますます使われているが、トレーニング済みのVLAモデルを未確認の環境、実施環境、タスクにデプロイするには、まだ適応が必要である。パラメータ効率の良い微調整(PEFT)、特にLoRAはVLAポリシーでは一般的であるが、露出容量ノブ(ランク)は均一に転送されない。 LLM (e g , $r \in \{4, 8\}$) の小さなランクは十分であるが、スペクトル分析では、VLAはより大きなランク (e g , $r \approx 128$) を必要とする可能性がある。固定ランク更新を入力および層単位のキャパシティに置き換えるランク適応微調整法であるLoRA-SP(Select-Prune)を提案する。 LoRA-SPはSVD方式のパラメータ化と、非負のスコアが共有ベクトルバンク上の特異値として働く小さなルータを用いる。活性集合は累積平方点上のエネルギー目標$E(k) \ge η$によって選択され、スペクトル解析を通じて近似誤差に直接リンクする。トレーニング中、$η$は数方向にエネルギーを集中させ、精度を保ちながらより少ないベクトルに依存するようにルータに教える。これにより、クロスタスク干渉を低減し、一般化を改善するコンパクトアダプタが得られる。見えないAgileX PiPERの2つのVLAバックボーン(π_0$とSmolVLA)にまたがる4つの実ロボット操作タスクにおいて、LoRA-SPはトレーニング可能なパラメータをはるかに少なくして完全な微調整に適合する。

論文の概要: Adaptive Capacity Allocation for Vision Language Action Fine-tuning

関連論文リスト