Fugu-MT 論文翻訳(概要): SCAR: Self-Supervised Continuous Action Representation Learning

論文の概要: SCAR: Self-Supervised Continuous Action Representation Learning

arxiv url: http://arxiv.org/abs/2605.16412v1
Date: Wed, 13 May 2026 16:23:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:46.305433
Title: SCAR: Self-Supervised Continuous Action Representation Learning
Title（参考訳）: SCAR: 自己監督型継続的行動表現学習
Authors: Hongjia Liu, Fan Feng, Minghao Fu, Xinyue Wang, Haofei Lu, Biwei Huang,
Abstract要約: 視覚的遷移から具現化された動作表現を学習するための共同逆フォワード動的フレームワークであるSCARを提案する。事前訓練された生成バックボーン上に構築されたSCARは、逆ダイナミクスモデル(IDM)を使用して、潜時観測ペアから潜時動作を推論し、フォワードダイナミクスモデル(FDM)を用いて、それらに条件付けられた将来のダイナミクスを予測する。 Procgen と Robotwin のデータセットの実験により、学習された統合潜在行動表現は、具体化固有の生の行動よりも、世界モデリングのためのより強い条件付けインターフェースとして機能することが示された。
参考スコア（独自算出の注目度）: 36.917304453471864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.
Abstract（参考訳）: エンボディドインテリジェンスにおけるアクションの中心的な役割にもかかわらず、視覚的遷移から伝達可能なアクション表現を学ぶことは根本的な課題であり、特に世界モデルが限られたデータの下でエンボディメント全体にわたって一般化しなければならない場合である。我々は、アクションは単なる補助的条件づけ信号ではなく、エンボディメント特異的なアクティベーションから制御可能な変化を分離する表現的要因であると主張している。そこで本研究では,視覚的遷移から具現化された動作表現を学習するための,共用逆フォワード動的フレームワークであるSCARを提案する。事前訓練された生成バックボーン上に構築されたSCARは、逆ダイナミクスモデル(IDM)を使用して、潜時観測ペアから潜時動作を推論し、フォワードダイナミクスモデル(FDM)を用いて、それらに条件付けられた将来のダイナミクスを予測する。一般的な視覚的ボトルネックよりも遅延空間の移動を可能とするため,任意の視覚的符号化を制限するために標準ガウスに対する遅延動作の後方を規則化し,また,エンボディメントや環境固有のニュアンス要因を抑えるために逆方向の不変性を導入する。 Procgen と Robotwin のデータセットの実験では、学習された統合潜在行動表現は、具体化固有の生のアクションよりも、世界モデリングのためのより強い条件付けインターフェースとして機能し、クロス・エボデーメントの低データ適応とクロス・タスク・トランスフォーメーションの改善が示されている。これらの結果は、行動は実施形態全体にわたる制御可能な変化の共有表現として学習できることを示唆し、より伝達可能で一般化可能な世界モデルのためのインターフェースを提供する。

論文の概要: SCAR: Self-Supervised Continuous Action Representation Learning

関連論文リスト