Fugu-MT 論文翻訳(概要): X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models

論文の概要: X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2605.25044v1
Date: Sun, 24 May 2026 12:41:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.670996
Title: X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models
Title（参考訳）: X-DiffVLA:ビジョン・ランゲージ・アクションモデルのためのX-Embodied Diffusion Action Heads
Authors: Boyu Li, Chaoyi Xu, Haoqi Yuan, Xinrun Xu, Börje F. Karlsson, Dongbin Zhao, Haoran Li, Zongqing Lu,
Abstract要約: 本稿では,X-DiffVLA(拡散型VLAモデル)を提案する。 X-DiffVLAは拡散モデルの生成的強度を利用して、クロスボディーデータセットの多様性と潜時相関をキャプチャすることができる。 X-DiffVLAは,それぞれ15.3%,12.5%の改善が得られた。
参考スコア（独自算出の注目度）: 39.033717938466246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning universal policies from cross-embodied data remains a fundamental challenge in robotics. Although Vision-Language-Action (VLA) models are pre-trained on large and diverse datasets, they typically rely on embodiment-specific fine-tuning to achieve strong performance in downstream tasks. This requirement severely limits their generalization capability and restricts knowledge transfer across embodiments performing similar tasks. To overcome these limitations, we focus on cross-embodied settings with shared robotic bases and heterogeneous end-effectors, and propose X-DiffVLA, a diffusion-based VLA model featuring a unified cross-embodied action head. X-DiffVLA can leverage the generative strengths of diffusion models to capture both the diversity and latent correlations in cross-embodied datasets. Specifically, we introduce Embodiment Forcing, a classifier-free guidance technique to implicitly steer action generation toward embodiment-specific functional components, capturing fine-grained structural nuances without explicit supervision. In addition, a Morphological Tree Diffusion approach is designed to strengthen behavioral correlations across diverse end-effectors, maximizing the transferability of heterogeneous demonstrations. Experimental results across RoboCasa and Isaac Gym, covering different embodiments from grippers to dexterous hands, show that X-DiffVLA achieves state-of-the-art performance, with improvements of 15.3% and 12.5%, respectively. Real-world evaluations further validate the robustness of the proposed framework and its effectiveness in scalable cross-embodied policy learning.
Abstract（参考訳）: クロスエンボディードデータから普遍的なポリシーを学ぶことは、ロボティクスにおける根本的な課題である。 VLA(Vision-Language-Action)モデルは、大規模で多様なデータセットで事前トレーニングされているが、一般的には、下流タスクで強力なパフォーマンスを達成するために、エンボディメント固有の微調整に依存している。この要件は、それらの一般化能力を厳しく制限し、類似のタスクを実行する実施形態間での知識伝達を制限する。これらの制約を克服するために,共用ロボットベースと異種エンドエフェクターを用いたクロス・エボディード・セッティングに着目し,クロス・エボディード・アクションヘッドを備えた拡散型VLAモデルであるX-DiffVLAを提案する。 X-DiffVLAは拡散モデルの生成的強度を利用して、クロスボディーデータセットの多様性と潜時相関をキャプチャすることができる。具体的には、エンボディメント固有の機能コンポーネントに対して暗黙的にアクション生成を制御し、明示的な監督なしにきめ細かな構造的ニュアンスをキャプチャする、分類子フリーガイダンス技術であるEmbodiment Forcingを紹介する。さらに、形態木拡散法は、多様なエンドエフェクター間の挙動相関を強化し、不均一な実演の伝達性を最大化するように設計されている。 RoboCasaとIsaac Gymによる実験結果では、グリッパーから器用な手までの様々な実施形態をカバーしており、X-DiffVLAは、それぞれ15.3%と12.5%の改善を達成している。実世界の評価は、提案フレームワークの堅牢性と、スケーラブルな相互実施型政策学習におけるその有効性をさらに検証する。

論文の概要: X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models

関連論文リスト