Universal Actions for Enhanced Embodied Foundation Models
- URL: http://arxiv.org/abs/2501.10105v2
- Date: Sat, 08 Mar 2025 13:55:48 GMT
- Title: Universal Actions for Enhanced Embodied Foundation Models
- Authors: Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan,
- Abstract summary: We introduce UniAct, a new embodied foundation modeling framework operating in a Universal Action Space.<n>Our learned universal actions capture the generic atomic behaviors across diverse robots by exploiting their shared structural features.<n>Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots.
- Score: 25.755178700280933
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-domain data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in a Universal Action Space. Our learned universal actions capture the generic atomic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. The universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions. Project page: https://github.com/2toinf/UniAct
Related papers
- ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments [134.95780765985515]
We introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation.<n>Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments.<n>We propose the Scaffold-Specialize-Reconcile(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging.
arXiv Detail & Related papers (2026-03-03T17:53:45Z) - RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation [37.52152452548065]
RoboGene is an agentic framework designed to automate the generation of diverse, physically plausible manipulation tasks.<n>We conduct extensive quantitative analysis and large-scale real-world experiments, collecting datasets of 18k trajectories.<n>Results demonstrate that RoboGene significantly outperforms state-of-the-art foundation models.
arXiv Detail & Related papers (2026-02-18T13:29:43Z) - HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [83.41714103649751]
Development of embodied intelligence models depends on access to high-quality robot demonstration data.<n>We present HiMoE-VLA, a novel vision-language-action framework tailored to handle diverse robotic data with heterogeneity.<n>HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalizations.
arXiv Detail & Related papers (2025-12-05T13:21:05Z) - GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation [5.426712963311386]
Generalized Behavior Cloning (GBC) is a comprehensive and unified solution designed to solve this end-to-end challenge.<n>First, an adaptive data pipeline leverages a differentiable IK network to automatically retarget any human MoCap data to any humanoid.<n>Second, our novel DAgger-MMPPO algorithm with its MMTransformer architecture learns robust, high-fidelity imitation policies.
arXiv Detail & Related papers (2025-08-13T17:28:39Z) - Is Diversity All You Need for Scalable Robotic Manipulation? [50.747150672933316]
We investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better"<n>We show that task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios.<n>We propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data.
arXiv Detail & Related papers (2025-07-08T17:52:44Z) - From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots [34.348365055311326]
BumbleBee (BB) is an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation.<n>BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world.
arXiv Detail & Related papers (2025-06-15T09:09:34Z) - AnyBody: A Benchmark Suite for Cross-Embodiment Manipulation [59.671764778486995]
Generalizing control policies to novel embodiments remains a fundamental challenge in enabling scalable and transferable learning in robotics.<n>We introduce a benchmark for learning cross-embodiment manipulation, focusing on two foundational tasks-reach and push-across a diverse range of morphologies.<n>We evaluate the ability of different RL policies to learn from multiple morphologies and to generalize to novel ones.
arXiv Detail & Related papers (2025-05-21T00:21:38Z) - UniVLA: Learning to Act Anywhere with Task-centric Latent Actions [32.83715417294052]
UniVLA is a new framework for learning cross-embodiment vision-language-action (VLA) policies.<n>We derive task-centric action representations from videos with a latent action model.<n>We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments.
arXiv Detail & Related papers (2025-05-09T15:11:13Z) - Towards Embodiment Scaling Laws in Robot Locomotion [36.86431442666063]
We investigate scaling artificial handling laws across multiple embodiments.<n>We find that increasing the number of embodiments improves generalization to unseen ones.<n>Results represent a step toward general embodied robotics with potential relevance to adaptive control.
arXiv Detail & Related papers (2025-05-09T03:25:43Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.
Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.
Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Capability-Aware Shared Hypernetworks for Flexible Heterogeneous Multi-Robot Coordination [2.681242476043447]
We propose Capability-Aware Shared Hypernetworks (CASH) to enable a single architecture to dynamically adapt to each robot and the current context.
CASH encodes shared decision making strategies that can be adapted to each robot based on local observations and the robots' individual and collective capabilities.
arXiv Detail & Related papers (2025-01-10T15:39:39Z) - GRAPE: Generalizing Robot Policy via Preference Alignment [60.36381142741252]
We present GRAPE: Generalizing Robot Policy via Preference Alignment.<n>We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively.<n> GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z) - Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation [49.03165169369552]
By training a single policy across many different kinds of robots, a robot learning method can leverage much broader and more diverse datasets.
We propose CrossFormer, a scalable and flexible transformer-based policy that can consume data from any embodiment.
We demonstrate that the same network weights can control vastly different robots, including single and dual arm manipulation systems, wheeled robots, quadcopters, and quadrupeds.
arXiv Detail & Related papers (2024-08-21T17:57:51Z) - A Novel Cross-Perturbation for Single Domain Generalization [54.612933105967606]
Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain.
The limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance.
We propose CPerb, a simple yet effective cross-perturbation method to enhance the diversity of the training data.
arXiv Detail & Related papers (2023-08-02T03:16:12Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - PACT: Perception-Action Causal Transformer for Autoregressive Robotics
Pre-Training [25.50131893785007]
This work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot.
We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion.
We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously.
arXiv Detail & Related papers (2022-09-22T16:20:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.