Humanoid World Models: Open World Foundation Models for Humanoid Robotics
- URL: http://arxiv.org/abs/2506.01182v2
- Date: Tue, 08 Jul 2025 20:18:16 GMT
- Title: Humanoid World Models: Open World Foundation Models for Humanoid Robotics
- Authors: Muhammad Qasim Ali, Aditya Sridhar, Shahbuland Matiana, Alex Wong, Mohammad Al-Sharman,
- Abstract summary: We introduce Humanoid World Models (HWM), a family of lightweight, open-source models that forecast future egocentric video conditioned on humanoid control tokens.<n>We train two types of generative models, Masked Transformers and Flow-Matching, on 100 hours of humanoid demonstrations.<n>Our parameter-sharing techniques reduce model size by 33-53% with minimal impact on performance or visual fidelity.
- Score: 2.7781524868854715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humanoid robots, with their human-like form, are uniquely suited for interacting in environments built for people. However, enabling humanoids to reason, plan, and act in complex open-world settings remains a challenge. World models, models that predict the future outcome of a given action, can support these capabilities by serving as a dynamics model in long-horizon planning and generating synthetic data for policy learning. We introduce Humanoid World Models (HWM), a family of lightweight, open-source models that forecast future egocentric video conditioned on humanoid control tokens. We train two types of generative models, Masked Transformers and Flow-Matching, on 100 hours of humanoid demonstrations. Additionally, we explore architectural variants with different attention mechanisms and parameter-sharing strategies. Our parameter-sharing techniques reduce model size by 33-53% with minimal impact on performance or visual fidelity. HWMs are designed to be trained and deployed in practical academic and small-lab settings, such as 1-2 GPUs.
Related papers
- HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers [60.86393841247567]
HumanRAM is a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images.<n>Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions.<n> Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.
arXiv Detail & Related papers (2025-06-03T17:50:05Z) - DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories [120.25799361925387]
DreamGen is a pipeline for training robot policies that generalize across behaviors and environments through neural trajectories.<n>Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
arXiv Detail & Related papers (2025-05-19T04:55:39Z) - A Personalized Data-Driven Generative Model of Human Motion [0.0]
We propose a fully data-driven approach to generate original motion that captures the unique characteristics of specific individuals.<n>Our model effectively replicates the velocity distribution and amplitude envelopes of the individual it was trained on, remaining different from other individuals.
arXiv Detail & Related papers (2025-03-19T14:03:20Z) - GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [133.23509142762356]
General-purpose robots need a versatile body and an intelligent mind.<n>Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy.<n>We introduce GR00T N1, an open foundation model for humanoid robots.
arXiv Detail & Related papers (2025-03-18T21:06:21Z) - VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - Making Large Language Models into World Models with Precondition and Effect Knowledge [1.8561812622368763]
We show that Large Language Models (LLMs) can be induced to perform two critical world model functions.
We validate that the precondition and effect knowledge generated by our models aligns with human understanding of world dynamics.
arXiv Detail & Related papers (2024-09-18T19:28:04Z) - Fine-Tuning Generative Models as an Inference Method for Robotic Tasks [18.745665662647912]
We investigate how to quickly adapt the sample generation of neural network models to observations in robotic tasks.
The key idea is to quickly fine-tune the model by fitting it to generated samples matching the observed evidence.
We show that our method can be applied to both autoregressive models and variational autoencoders.
arXiv Detail & Related papers (2023-10-19T16:11:49Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Real-World Humanoid Locomotion with Reinforcement Learning [92.85934954371099]
We present a fully learning-based approach for real-world humanoid locomotion.
Our controller can walk over various outdoor terrains, is robust to external disturbances, and can adapt in context.
arXiv Detail & Related papers (2023-03-06T18:59:09Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - Factored World Models for Zero-Shot Generalization in Robotic
Manipulation [7.258229016768018]
We learn to generalize over robotic pick-and-place tasks using object-factored world models.
We use a residual stack of graph neural networks that receive action information at multiple levels in both their node and edge neural networks.
We show that an ensemble of our models can be used to plan for tasks involving up to 12 pick and place actions using search.
arXiv Detail & Related papers (2022-02-10T21:26:11Z) - Real-time Pose Estimation from Images for Multiple Humanoid Robots [45.182157261640675]
We present a lightweight pose estimation model that can work in real-time on humanoid robots in the RoboCup Humanoid League environment.
The results of this work have the potential to enable many advanced behaviors for soccer-playing robots.
arXiv Detail & Related papers (2021-07-06T15:33:57Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.