AirScape: An Aerial Generative World Model with Motion Controllability
- URL: http://arxiv.org/abs/2507.08885v2
- Date: Fri, 10 Oct 2025 07:40:25 GMT
- Title: AirScape: An Aerial Generative World Model with Motion Controllability
- Authors: Baining Zhao, Rongze Tang, Mingyuan Jia, Ziyou Wang, Fanghang Man, Xin Zhang, Yu Shang, Weichen Zhang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li,
- Abstract summary: AirScape is the first world model designed for six-degree-of-bodied aerial agents.<n>It predicts future observation based on current visual inputs and motion intentions.
- Score: 29.696659138543136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How to enable agents to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase schedule to train a foundation model--initially devoid of embodied spatial knowledge--into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints. Experimental results demonstrate that AirScape significantly outperforms existing foundation models in 3D spatial imagination capabilities, especially with over a 50% improvement in metrics reflecting motion alignment. The project is available at: https://embodiedcity.github.io/AirScape/.
Related papers
- Walk through Paintings: Egocentric World Models from Internet Priors [65.30611174953958]
We present the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model.<n>Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers.<n>Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids.
arXiv Detail & Related papers (2026-01-21T18:59:32Z) - AirSim360: A Panoramic Simulation Platform within Drone View [63.238263531772446]
AirSim360 is a simulation platform for omnidirectional data from aerial viewpoints.<n>AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and entity-level understanding.<n>Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting.
arXiv Detail & Related papers (2025-12-01T18:59:30Z) - SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control [85.91101551600978]
We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements.<n>We build a foundation model for motion tracking by scaling along three axes: network size, dataset volume, and compute.<n>We show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces.
arXiv Detail & Related papers (2025-11-11T04:37:40Z) - LookOut: Real-World Humanoid Egocentric Navigation [61.14016011125957]
We introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video.<n>To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features.<n>Motivated by the lack of training data in this space, we present a dataset collected through this approach.
arXiv Detail & Related papers (2025-08-20T06:43:36Z) - Learning Sequential Kinematic Models from Demonstrations for Multi-Jointed Articulated Objects [6.125464415922235]
We introduce OKSMs, a representation capturing both kinematic constraints and manipulation order for multi-DoF objects.<n> Pokenet improves joint axis and state estimation by over 20 percent on real-world data compared to prior methods.
arXiv Detail & Related papers (2025-05-09T18:09:06Z) - Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z) - TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes [14.924741503611749]
We introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target.
We introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion.
To alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content.
arXiv Detail & Related papers (2024-03-27T04:03:55Z) - Humanoid Locomotion as Next Token Prediction [84.21335675130021]
Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories.
We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot.
Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize commands not seen during training like walking backward.
arXiv Detail & Related papers (2024-02-29T18:57:37Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z) - Autonomous Marker-less Rapid Aerial Grasping [5.892028494793913]
We propose a vision-based system for autonomous rapid aerial grasping.
We generate a dense point cloud of the detected objects and perform geometry-based grasp planning.
We show the first use of geometry-based grasping techniques with a flying platform.
arXiv Detail & Related papers (2022-11-23T16:25:49Z) - Aerial Monocular 3D Object Detection [67.20369963664314]
DVDET is proposed to achieve aerial monocular 3D object detection in both the 2D image space and the 3D physical space.<n>To address the severe view deformation issue, we propose a novel trainable geo-deformable transformation module.<n>To encourage more researchers to investigate this area, we will release the dataset and related code.
arXiv Detail & Related papers (2022-08-08T08:32:56Z) - NavDreams: Towards Camera-Only RL Navigation Among Humans [35.57943738219839]
We investigate whether the world model concept, which has shown results for modeling and learning policies in Atari games, can also be applied to the camera-based navigation problem.
We create simulated environments where a robot must navigate past static and moving humans without colliding in order to reach its goal.
We find that state-of-the-art methods are able to achieve success in solving the navigation problem, and can generate dream-like predictions of future image-sequences.
arXiv Detail & Related papers (2022-03-23T09:46:44Z) - Rapid Exploration for Open-World Navigation with Latent Goal Models [78.45339342966196]
We describe a robotic learning system for autonomous exploration and navigation in diverse, open-world environments.
At the core of our method is a learned latent variable model of distances and actions, along with a non-parametric topological memory of images.
We use an information bottleneck to regularize the learned policy, giving us (i) a compact visual representation of goals, (ii) improved generalization capabilities, and (iii) a mechanism for sampling feasible goals for exploration.
arXiv Detail & Related papers (2021-04-12T23:14:41Z) - Future Frame Prediction for Robot-assisted Surgery [57.18185972461453]
We propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences.
Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools.
arXiv Detail & Related papers (2021-03-18T15:12:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.