Humanoid Locomotion as Next Token Prediction
- URL: http://arxiv.org/abs/2402.19469v1
- Date: Thu, 29 Feb 2024 18:57:37 GMT
- Title: Humanoid Locomotion as Next Token Prediction
- Authors: Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran,
Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik
- Abstract summary: Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories.
We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot.
Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize commands not seen during training like walking backward.
- Score: 84.21335675130021
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We cast real-world humanoid control as a next token prediction problem, akin
to predicting the next word in language. Our model is a causal transformer
trained via autoregressive prediction of sensorimotor trajectories. To account
for the multi-modal nature of the data, we perform prediction in a
modality-aligned way, and for each input token predict the next token from the
same modality. This general formulation enables us to leverage data with
missing modalities, like video trajectories without actions. We train our model
on a collection of simulated trajectories coming from prior neural network
policies, model-based controllers, motion capture data, and YouTube videos of
humans. We show that our model enables a full-sized humanoid to walk in San
Francisco zero-shot. Our model can transfer to the real world even when trained
on only 27 hours of walking data, and can generalize to commands not seen
during training like walking backward. These findings suggest a promising path
toward learning challenging real-world control tasks by generative modeling of
sensorimotor trajectories.
Related papers
- Multi-Transmotion: Pre-trained Model for Human Motion Prediction [68.87010221355223]
Multi-Transmotion is an innovative transformer-based model designed for cross-modality pre-training.
Our methodology demonstrates competitive performance across various datasets on several downstream tasks.
arXiv Detail & Related papers (2024-11-04T23:15:21Z) - VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions [10.748597086208145]
In this work, we propose a novel method that also incorporates visual input from surround-view cameras.
Our method achieves a latency of 53 ms, making it feasible for real-time processing.
Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance.
arXiv Detail & Related papers (2024-07-17T06:39:52Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - InCrowdFormer: On-Ground Pedestrian World Model From Egocentric Views [28.54213112712818]
We introduce an on-ground Pedestrian World Model that can predict how pedestrians move around an observer in the crowd on the ground plane.
InCrowdFormer fully leverages the Transformer architecture by modeling pedestrian interaction and egocentric to top-down view transformation with attention.
We encode the uncertainties arising from unknown pedestrian heights with latent codes to predict the posterior distributions of pedestrian positions.
arXiv Detail & Related papers (2023-03-16T17:51:02Z) - Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion [88.45326906116165]
We present a new framework to formulate the trajectory prediction task as a reverse process of motion indeterminacy diffusion (MID)
We encode the history behavior information and the social interactions as a state embedding and devise a Transformer-based diffusion model to capture the temporal dependencies of trajectories.
Experiments on the human trajectory prediction benchmarks including the Stanford Drone and ETH/UCY datasets demonstrate the superiority of our method.
arXiv Detail & Related papers (2022-03-25T16:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.