Fugu-MT 論文翻訳(概要): VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

論文の概要: VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

arxiv url: http://arxiv.org/abs/2604.03956v1
Date: Sun, 05 Apr 2026 04:23:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.848828
Title: VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models
Title（参考訳）: VLA-Forget: 身体的基礎モデルのためのビジョン・ランゲージ・アクション・アンラーニング
Authors: Ravi Ranjan, Agoritsa Polyzou,
Abstract要約: OpenVLAスタイルのポリシーでは、動作は融合したビジュアルエンコーダ、クロスモーダルプロジェクタ、トークン化されたロボットアクションを予測する言語バックボーンを通じて生成される。 VLA-Forgetは、認識のための比認識選択的編集と、層選択的推論/アクションアンラーニングを組み合わせたハイブリッドアンラーニングフレームワークである。
参考スコア（独自算出の注目度）: 0.10742675209112619
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) models are emerging as embodied foundation models for robotic manipulation, but their deployment introduces a new unlearning challenge: removing unsafe, spurious, or privacy-sensitive behaviors without degrading perception, language grounding, and action control. In OpenVLA-style policies, behavior is produced through a fused visual encoder, a cross-modal projector, and a language backbone that predicts tokenized robot actions, so undesirable knowledge can be distributed across perception, alignment, and reasoning/action layers rather than confined to a single module. Consequently, partial unlearning applied only to the vision stack or only to the language backbone is often insufficient, while conventional unlearning baselines designed for standalone vision or language models may leave residual forgetting or incur unnecessary utility loss in embodied settings. We propose VLA-Forget, a hybrid unlearning framework that combines ratio-aware selective editing for perception and cross-modal specificity with layer-selective reasoning/action unlearning for utility-preserving forgetting. VLA-Forget jointly optimizes three objectives: targeted forgetting, perceptual preservation, and reasoning retention, through staged updates over the visual encoder, projector, and upper action-generating transformer blocks. Across forget-set behavior probes and retain-task evaluations, VLA-Forget improves forgetting efficacy by 10%, preserves perceptual specificity by 22%, retains reasoning and task success by 9%, and reduces post-quantization recovery by 55% relative to strong unlearning baselines.
Abstract（参考訳）: 視覚言語アクション(VLA)モデルは、ロボット操作の基礎モデルとして登場しつつあるが、その展開には、認識の低下や言語基盤、アクションコントロールを損なうことなく、安全で刺激的、あるいはプライバシーに敏感な行動を取り除くという、新たな未学習の課題が導入されている。 OpenVLAスタイルのポリシーでは、動作は融合したビジュアルエンコーダ、クロスモーダルプロジェクタ、およびトークン化されたロボットアクションを予測する言語バックボーンを通じて生成されるため、望ましくない知識は単一のモジュールに制限されるのではなく、知覚、アライメント、推論/アクション層に分散することができる。そのため、視覚スタックや言語バックボーンにのみ適用される部分的アンラーニングは不十分な場合が多いが、スタンドアロンの視覚モデルや言語モデル用に設計された従来のアンラーニングベースラインは、残余の忘れ物や不要なユーティリティ損失を具体的設定で残すことがある。 VLA-Forgetは、認識のための比認識の選択的編集と横断的特異性と、ユーティリティ保存忘れのための層選択的推論/アクション未学習を組み合わせたハイブリッドアンラーニングフレームワークである。 VLA-Forgetは、視覚エンコーダ、プロジェクタ、上部アクション生成トランスフォーマーブロックのステージ更新を通じて、ターゲットの忘れ、知覚的保存、推論保持の3つの目的を共同で最適化する。 VLA-Forgetは、リクエストセットの行動プローブとretain-task評価を通じて、リクエストの有効性を10%改善し、知覚的特異性を22%維持し、推論とタスク成功を9%維持し、未学習の強いベースラインと比較して、時間後リカバリを55%削減する。

論文の概要: VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

関連論文リスト