Fugu-MT 論文翻訳(概要): Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

論文の概要: Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

arxiv url: http://arxiv.org/abs/2605.29577v1
Date: Thu, 28 May 2026 08:22:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.063011
Title: Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning
Title（参考訳）: 逆ダイナミクス学習による視覚・言語・行動モデルにおける状態エイリアス
Authors: Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim,
Abstract要約: 本稿では,VLA(Vision-Language-Action)視覚エンコーダを直接監督する補助目的として,逆ダイナミクス学習を導入する。本研究の目的は、現在の観測と将来の観測のアクションを予測することにより、エンコーダが低レベルの動作を決定する細粒度の視覚的特徴を捉えることを奨励する。 CALVIN ABC-DとSimplerEnvの実験では、様々なVLAベースラインで一貫した利得を示している。
参考スコア（独自算出の注目度）: 32.59795755994378
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space. To mitigate state aliasing, we introduce inverse dynamics learning as an auxiliary objective that directly supervises the VLA vision encoder. By predicting the action between current and future observations, our objective encourages the encoder to capture fine-grained visual distinctions that determine low-level actions. We further use pseudo-reversed supervision to expose the encoder to a broader range of action directions and improve generalization under limited robot demonstrations. Our method applies to diverse VLA baselines, uses only standard observation-action pairs without additional annotations, and preserves the original inference pipeline at test time. Experiments on CALVIN ABC-D and SimplerEnv show consistent gains across diverse VLA baselines. Frozen-encoder probing and state-feature alignment analyses further show that our method learns state-discriminative visual representations that reduce state aliasing and better align with robot state changes.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、事前学習された視覚言語モデル(VLM)を動作予測に適応させることで、知覚、推論、ロボット操作の制御を統一する有望なフレームワークとして登場した。しかしながら、VLMから派生した表現は、低レベル制御に必要な微妙な視覚的区別に敏感ではないことが多く、視覚的に類似した状態が実質的に異なる動作を必要とする。以前のVLA研究では、将来のフレームや2Dグラウンディングポイントやトレース、中間空間推論ステップなどの視覚的あるいは推論的な出力を生成することで、視覚的理解を改善するが、これらの目的は通常、エンドツーエンドの予測によってのみ間接的に視覚エンコーダを形成し、学習された視覚的特徴空間における状態エイリアスを明示的に分析しない。状態エイリアスを緩和するために,VLAビジョンエンコーダを直接監督する補助目的として,逆ダイナミクス学習を導入する。本研究の目的は、現在の観測と将来の観測のアクションを予測することにより、エンコーダが低レベルの動作を決定する細粒度の視覚的特徴を捉えることを奨励する。さらに、擬似反転監視を用いて、エンコーダを幅広い動作方向へ露出させ、限定的なロボットデモによる一般化を改善する。提案手法は多様なVLAベースラインに適用され,アノテーションを付加せずに標準観測-アクションペアのみを使用し,テスト時に元の推論パイプラインを保持する。 CALVIN ABC-DとSimplerEnvの実験では、様々なVLAベースラインで一貫した利得を示している。さらに, 凍結エンコーダの探索と状態アライメント解析により, 状態エイリアスを低減し, ロボットの状態変化との整合性を向上する, 状態識別的視覚表現を学習できることが示唆された。

論文の概要: Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

関連論文リスト