Related papers: Embedding Morphology into Transformers for Cross-Robot Policy Learning

Embedding Morphology into Transformers for Cross-Robot Policy Learning

URL: http://arxiv.org/abs/2603.00182v1
Date: Thu, 26 Feb 2026 21:42:34 GMT
Title: Embedding Morphology into Transformers for Cross-Robot Policy Learning
Authors: Kei Suzuki, Jing Liu, Ye Wang, Chiori Hori, Matthew Brand, Diego Romeres, Toshiaki Koike-Akino,
Abstract summary: Cross-robot policy learning remains a central challenge in robot learning.<n>We propose an embodiment-aware transformer policy that injects morphology via three mechanisms.<n>Across a range of embodiments, this structured integration consistently improves performance over a vanilla pi0.5 VLA baseline.
Score: 24.85486808758998
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-robot policy learning -- training a single policy to perform well across multiple embodiments -- remains a central challenge in robot learning. Transformer-based policies, such as vision-language-action (VLA) models, are typically embodiment-agnostic and must infer kinematic structure purely from observations, which can reduce robustness across embodiments and even limit performance within a single embodiment. We propose an embodiment-aware transformer policy that injects morphology via three mechanisms: (1) kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; (2) a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges; and (3) joint-attribute conditioning that augments topology with per-joint descriptors to capture semantics beyond connectivity. Across a range of embodiments, this structured integration consistently improves performance over a vanilla pi0.5 VLA baseline, indicating improved robustness both within an embodiment and across embodiments.

Related papers

Structural Action Transformer for 3D Dexterous Manipulation [80.07649565189035]
Cross-embodiment skill transfer is a challenge for high-DoF robotic hands.<n>Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations.<n>This paper proposes a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective.
arXiv Detail & Related papers (2026-03-04T11:38:12Z)
MOTIF: Learning Action Motifs for Few-shot Cross-Embodiment Transfer [55.982504915794514]
Cross-embodiment policies typically rely on shared-private architectures.<n>We introduce MOTIF for efficient few-shot cross-embodiment transfer.<n>We show that MOTIF significantly outperforms strong baselines in few-shot transfer scenarios.
arXiv Detail & Related papers (2026-02-14T13:21:40Z)
ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States [9.721009445297716]
ArtGen is a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics.<n>Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency.<n>A compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships.
arXiv Detail & Related papers (2025-12-13T17:00:03Z)
Rethinking the Role of Dynamic Sparse Training for Scalable Deep Reinforcement Learning [58.533203990515034]
Scaling neural networks has driven breakthrough advances in machine learning, yet this paradigm fails in deep reinforcement learning (DRL)<n>We show that dynamic sparse training strategies provide module-specific benefits that complement the primary scalability foundation established by architectural improvements.<n>We finally distill these insights into Module-Specific Training (MST), a practical framework that exploits the benefits of architectural improvements and demonstrates substantial scalability gains across diverse RL algorithms without algorithmic modifications.
arXiv Detail & Related papers (2025-10-14T03:03:08Z)
AnyBody: A Benchmark Suite for Cross-Embodiment Manipulation [59.671764778486995]
Generalizing control policies to novel embodiments remains a fundamental challenge in enabling scalable and transferable learning in robotics.<n>We introduce a benchmark for learning cross-embodiment manipulation, focusing on two foundational tasks-reach and push-across a diverse range of morphologies.<n>We evaluate the ability of different RL policies to learn from multiple morphologies and to generalize to novel ones.
arXiv Detail & Related papers (2025-05-21T00:21:38Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.