Fugu-MT 論文翻訳(概要): Cross-Hand Latent Representation for Vision-Language-Action Models

論文の概要: Cross-Hand Latent Representation for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.10158v1
Date: Tue, 10 Mar 2026 18:50:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.175807
Title: Cross-Hand Latent Representation for Vision-Language-Action Models
Title（参考訳）: 視覚・言語・行動モデルのためのクロスハンド潜在表現
Authors: Guangqi Jiang, Yutong Liang, Jianglong Ye, Jia-Yang Huang, Changwei Jing, Rocky Duan, Pieter Abbeel, Xiaolong Wang, Xueyan Zou,
Abstract要約: 器用な操作のための信頼性の高い視覚言語アクションモデルを訓練するには、多くのロボットハンドにわたる大規模な実演が必要である。 XL-VLAは視覚・言語・アクション・フレームワークであり,多種多様な手間で共有される潜在行動空間と統合される。
参考スコア（独自算出の注目度）: 49.32460749933983
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception--vision, sound, and language-guided intent--to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce XL-VLA, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that XL-VLA consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.
Abstract（参考訳）: 現実のロボットの自律にはデクスタース操作が不可欠であり、日常生活における人間の手作業の調整の中心的役割を反映している。人間は、視覚、音、言語誘導の意図を豊かなマルチモーダルな知覚に頼り、ロボットのための視覚に基づく言語条件の操作システムを動機付け、巧妙な行動を実行する。しかし、器用な操作のために信頼性の高い視覚言語アクション(VLA)モデルを訓練するには、多くのロボットハンドにわたる大規模な実演が必要である。さらに、新しい器用な実施形態が急速に出現するにつれて、各データ収集はコストがかかり実用的でないものとなり、スケーラブルなクロスエボディメント学習の必要性が生じる。 XL-VLAは視覚・言語・アクション・フレームワークであり,多種多様な手間で共有される潜在行動空間と統合される。このエンボディメント不変の潜在空間は、標準VLAアーキテクチャに直接プラグイン可能であり、シームレスなクロスボデーメントトレーニングと、既存のデータと新しく収集されたデータの効率的な再利用を可能にする。実験結果から,XL-VLAは生の接合空間で動作するベースラインVLAモデルより一貫して優れており,拡張性のある断熱操作のための有効なソリューションとして確立されている。

論文の概要: Cross-Hand Latent Representation for Vision-Language-Action Models

関連論文リスト