Fugu-MT 論文翻訳(概要): Contrastive Representation Regularization for Vision-Language-Action Models

論文の概要: Contrastive Representation Regularization for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2510.01711v1
Date: Thu, 02 Oct 2025 06:41:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 14:32:17.253721
Title: Contrastive Representation Regularization for Vision-Language-Action Models
Title（参考訳）: 視覚・言語・行動モデルに対するコントラスト表現規則化
Authors: Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin,
Abstract要約: 本稿では,ビジョン・ランゲージ・アクション(VLA)モデルの表現正規化であるロボット状態認識コントラスト損失(RS-CL)を紹介する。特に、RS-CLは、状態間の相対的な距離をソフト・インスペクションとして使用することにより、ロボットの受容状態とより密に表現する。実験の結果,RS-CLは最先端VLAモデルの操作性能を大幅に向上することが示された。
参考スコア（独自算出の注目度）: 64.10170453130324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.
Abstract（参考訳）: Vision-Language-Action(VLA)モデルは、事前訓練されたVision-Language Models(VLM)の豊かな表現を活用することで、ロボット操作の能力を示した。しかし、それらの表現は間違いなく準最適であり、制御行動や受容状態のようなロボット信号に敏感ではない。この問題に対処するために,VLAモデルの簡易かつ効果的な表現正規化であるロボット状態認識コントラシティブ・ロス(RS-CL)を導入し,VLM表現とロボット信号のギャップを埋める。特に、RS-CLは、状態間の相対的な距離をソフト・インスペクションとして使用することにより、ロボットの受容状態とより密に表現する。元のアクション予測の目的を補完するRS-CLは、軽量で標準のVLAトレーニングパイプラインと完全に互換性を持ちながら、制御関連表現学習を効果的に強化する。実験の結果、RS-CLは最先端のVLAモデルの操作性能を大幅に向上し、ロボカサ・キッチェンのピック・アンド・プレイス・タスクでは30.8%から41.5%に向上し、実際のロボット操作タスクでは45.0%から58.3%に向上した。

論文の概要: Contrastive Representation Regularization for Vision-Language-Action Models

関連論文リスト