Fugu-MT 論文翻訳(概要): Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

論文の概要: Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

arxiv url: http://arxiv.org/abs/2603.06001v1
Date: Fri, 06 Mar 2026 08:01:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.297412
Title: Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration
Title（参考訳）: 列車自由注意校正によるVLAモデルの言語的グラウンドの復元
Authors: Ninghao Zhang, Bin Zhu, Shijie Zhou, Jingjing Chen,
Abstract要約: VLA(Vision-Language-Action)モデルにより、ロボットは自然言語命令から直接操作タスクを実行することができる。言語命令がシーンに矛盾する場合でも、VLAポリシーが視覚的に妥当な動作を実行し続ける重要な障害モードを明らかにする。 Instruction-Guided Attention Recalibration (IGAR) を提案する。
参考スコア（独自算出の注目度）: 24.562540060971273
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルにより、ロボットは自然言語命令から直接操作タスクを実行できるようになる。しかしながら、OOD (Out-of-Distribution) 命令下での信頼性は未定である。本稿では,言語命令がシーンに矛盾する場合でも,VLAポリシーが視覚的に妥当な動作を継続するクリティカル・フェール・モードを明らかにする。我々はこの現象を言語失明と呼び、VLAポリシーは行動生成時の指示意味論よりも視覚的優先を優先する。この問題をシステマティックに解析するために,LIBEROデータセットから構築した診断ベンチマークICBenchを導入する。 Pi0、Pi0.5、OpenVLA OFTを含む3つの代表的なVLAアーキテクチャの評価は、論理的に不可能な命令にもかかわらず、これらのモデルが頻繁にタスクで成功することを示した。この問題を軽減するために,言語指導の影響を回復するために,注意分布を再バランスする列車フリー推論時機構であるIGAR(Instruction-Guided Attention Recalibration)を提案する。 IGARは再訓練やアーキテクチャの変更なしに動作し、既存のVLAモデルに直接適用することができる。 30のLIBEROタスクにわたる実験により、IGARはOODの矛盾命令下での誤実行を著しく低減し、ベースラインタスク性能を保っていることが示された。我々はまた、実際のフランカのロボットアームにおいて、IGARが不整合命令によって引き起こされる操作を効果的に防止するアプローチを検証する。

関連論文リスト

When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs [31.92520697946991]
VLA(Vision-Language-Action Model)は、ロボット制御における言語命令の基盤となることを約束するが、実際には言語に忠実に従わないことが多い。反ファクトの失敗は、最先端のVLAで発見されていないことが示される。本稿では,単純な2分岐推論方式であるCAGを提案する。
論文参考訳（メタデータ） (2026-02-19T18:59:20Z)
ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
本稿では、視覚的接地と摂動下での堅牢性を高めるために、視覚-受容器リバランスを備えた新しいVLAフレームワークReViPを提案する。具体的には、タスクステージオブザーバとして外部VLMを使用して、視覚的な観察からリアルタイムなタスク中心の視覚的手がかりを抽出する。本稿では,オブジェクトドロップのような制御された設定を持つLIBERO上に構築された最初のFalse-Completion Benchmark Suiteを提案する。
論文参考訳（メタデータ） (2026-01-23T11:31:07Z)
LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries [30.732526921367835]
LangForceは、ベイズ分解による命令を強制する新しいフレームワークである。我々は,新しいデータを必要としないLangForceの一般化を著しく改善することを示す。
論文参考訳（メタデータ） (2026-01-21T17:15:22Z)
Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy [59.44168425139687]
BayesVLA(ベイズVLA)は、前もってポリシーを視覚的アクションに分解し、ルック・トゥ・アクティベーションと言語条件付き可能性をサポートし、即時特定を可能にするベイズ因子化である。実験は、既存の方法と比較して、目に見えない命令、オブジェクト、環境に対して優れた一般化を示す。
論文参考訳（メタデータ） (2025-12-12T01:59:23Z)
AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models [60.39655329875822]
VLA(Vision-Language-Action)モデルは、ロボットが自然言語の命令を解釈し、多様なタスクを実行することを可能にするモデルである。このようなモデルを攻撃することへの関心は高まっているが、既存の手法の有効性は依然として不明である。我々はVLA開発ライフサイクルに合わせて統合されたフレームワークであるAttackVLAを提案する。
論文参考訳（メタデータ） (2025-11-15T10:30:46Z)
Learning Affordances at Inference-Time for Vision-Language-Action Models [50.93181349331096]
ロボット工学において、VLA(Vision-Language-Action Model)は複雑な制御タスクを解くための有望な道を提供する。本稿では,VLAの低レベルポリシーを過去の経験を条件とした高レベルVLMに接続するLITEN(Learning from Inference-Time Execution)を紹介する。提案手法は,低レベルVLAの計画の生成と実行を行う推論フェーズと,その結果を反映した評価フェーズとを反復する。
論文参考訳（メタデータ） (2025-10-22T16:43:29Z)
Do What? Teaching Vision-Language-Action Models to Reject the Impossible [53.40183895299108]
VLA(Vision-Language-Action)モデルは、さまざまなロボットタスクにおいて強力なパフォーマンスを示している。 Instruct-Verify-and-Act(IVA)を提案する。実験の結果,IVAはベースラインよりも97.56%の精度で虚偽の前提検出精度を向上させることがわかった。
論文参考訳（メタデータ） (2025-08-22T10:54:33Z)
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
開語彙検出コンテキスト下でのVLMの偏り予測の根本原因について検討した。私たちの観察は、非常に優れたトレーニングターゲットを生成する、単純で効果的なパラダイム、コード化されたMarvelOVDにつながります。我々の手法は、他の最先端技術よりも大きなマージンで優れている。
論文参考訳（メタデータ） (2024-07-31T09:23:57Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。