Fugu-MT 論文翻訳(概要): ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation

論文の概要: ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation

arxiv url: http://arxiv.org/abs/2605.30484v1
Date: Thu, 28 May 2026 19:03:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-01 20:56:50.188182
Title: ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation
Title（参考訳）: ELAN4D:プラグ・アンド・プレイ適応によるビジョン・ランゲージ・アクションモデルのためのエンボディメント中心4Dスーパービジョン
Authors: Zeyuan He, Bowen Yang, Zhirui Fang, Keru Zhou, Lei Jiang, Jingjing Qian, Fan Mo, Junchi Yan, Philip Torr, Xiu Li, Li Jiang, Jialin Yu,
Abstract要約: VLA(Vision-Language-Action)モデルでは、ロボット操作が約束されているが、既存のほとんどのポリシーは、現在の観測からアクションを直接回帰することで、反応する。 ELAN4Dは,将来のロボットキートラックによるポリシーを予測的時間的監視として強化する,実施中心の4D対応トレーニングフレームワークである。
参考スコア（独自算出の注目度）: 63.617951135459016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have shown promise for robotic manipulation, yet most existing policies operate reactively by directly regressing actions from current observations, without explicitly modeling future dynamics. This limits their ability to generalize under out-of-distribution perturbations. To address this issue, we propose ELAN4D, an embodiment-centric, 4D-aware training framework that enhances VLA policies with future robot keypoint tracks as predictive spatio-temporal supervision. Using only forward kinematics from proprioceptive states, we derive 3D displacement tracks of robot keypoints, such as joints and the end-effector, with negligible preprocess cost. These tracks provide metric and compact supervision without requiring external trackers or reconstruction. A plug-and-play auxiliary branch with a lightweight track decoder injects this 4D signal into the action expert while preserving the pretrained vision-language backbone through gradient isolation. The track decoder is discarded during inference, leaving the base policy interface unchanged. Extensive experiments on LIBERO, LIBERO-Plus, RoboTwin2.0 and real-world manipulation tasks demonstrate that ELAN4D consistently improves over strong VLA baselines, achieving the best overall performance and substantial gains under out-of-distribution perturbations, including camera, background, and layout shifts. These results highlight the effectiveness of embodiment-centric 4D supervision for building more robust and generalizable manipulation policies.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルではロボット操作が約束されているが、既存のほとんどのポリシーは、将来のダイナミクスを明示的にモデル化することなく、現在の観測からアクションを直接回帰することで、リアクティブに動作している。これにより、アウト・オブ・ディストリビューションの摂動の下で一般化する能力が制限される。この問題に対処するために,将来のロボットキーポイントトラックによるVLAポリシを予測時空間監視として強化する,エンボディメント中心の4D対応トレーニングフレームワークであるELAN4Dを提案する。前方運動学のみを用いて、関節やエンドエフェクタなどのロボットキーポイントの3次元変位トラックを、無視可能な前処理コストで導出する。これらのトラックは、外部のトラッカーや再構築を必要とせず、メートル法とコンパクトな監督を提供する。軽量トラックデコーダを備えたプラグ・アンド・プレイ補助分岐は、この4D信号をアクションエキスパートに注入し、勾配分離により予め訓練された視覚言語バックボーンを保存する。トラックデコーダは推論中に破棄され、基本ポリシーインターフェースは変更されない。 LIBERO、LIBERO-Plus、RoboTwin2.0、および実世界の操作タスクに関する大規模な実験により、ELAN4Dは、強力なVLAベースラインよりも一貫して改善され、最高の全体的なパフォーマンスと、カメラ、バックグラウンド、レイアウトシフトを含む配布外摂動の下で大幅に向上することを示した。これらの結果は、より堅牢で汎用的な操作ポリシーを構築するための、実施中心の4D監視の有効性を強調している。

論文の概要: ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation

関連論文リスト