Neural World Models for Computer Vision
- URL: http://arxiv.org/abs/2306.09179v1
- Date: Thu, 15 Jun 2023 14:58:21 GMT
- Title: Neural World Models for Computer Vision
- Authors: Anthony Hu
- Abstract summary: We present a framework to train a world model and a policy, parameterised by deep neural networks.
We leverage important computer vision concepts such as geometry, semantics, and motion to scale world models to complex urban driving scenes.
Our model can jointly predict static scene, dynamic scene, and ego-behaviour in an urban driving environment.
- Score: 2.741266294612776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans navigate in their environment by learning a mental model of the world
through passive observation and active interaction. Their world model allows
them to anticipate what might happen next and act accordingly with respect to
an underlying objective. Such world models hold strong promises for planning in
complex environments like in autonomous driving. A human driver, or a
self-driving system, perceives their surroundings with their eyes or their
cameras. They infer an internal representation of the world which should: (i)
have spatial memory (e.g. occlusions), (ii) fill partially observable or noisy
inputs (e.g. when blinded by sunlight), and (iii) be able to reason about
unobservable events probabilistically (e.g. predict different possible
futures). They are embodied intelligent agents that can predict, plan, and act
in the physical world through their world model. In this thesis we present a
general framework to train a world model and a policy, parameterised by deep
neural networks, from camera observations and expert demonstrations. We
leverage important computer vision concepts such as geometry, semantics, and
motion to scale world models to complex urban driving scenes.
First, we propose a model that predicts important quantities in computer
vision: depth, semantic segmentation, and optical flow. We then use 3D geometry
as an inductive bias to operate in the bird's-eye view space. We present for
the first time a model that can predict probabilistic future trajectories of
dynamic agents in bird's-eye view from 360{\deg} surround monocular cameras
only. Finally, we demonstrate the benefits of learning a world model in
closed-loop driving. Our model can jointly predict static scene, dynamic scene,
and ego-behaviour in an urban driving environment.
Related papers
- Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving [15.100104512786107]
Drive-OccWorld adapts a visioncentric- 4D forecasting world model to end-to-end planning for autonomous driving.
We propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model.
Experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy.
arXiv Detail & Related papers (2024-08-26T11:53:09Z) - Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond [101.15395503285804]
General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI)
In this survey, we embark on a comprehensive exploration of the latest advancements in world models.
We examine challenges and limitations of world models, and discuss their potential future directions.
arXiv Detail & Related papers (2024-05-06T14:37:07Z) - EgoGen: An Egocentric Synthetic Data Generator [53.32942235801499]
EgoGen is a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks.
At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment.
We demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views.
arXiv Detail & Related papers (2024-01-16T18:55:22Z) - OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [67.49461023261536]
We learn a new framework of learning a world model, OccWorld, in the 3D Occupancy space.
We simultaneously predict the movement of the ego car and the evolution of the surrounding scenes.
OccWorld produces competitive planning results without using instance and map supervision.
arXiv Detail & Related papers (2023-11-27T17:59:41Z) - Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z) - Model-Based Imitation Learning for Urban Driving [26.782783239210087]
We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving.
Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment.
Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment.
arXiv Detail & Related papers (2022-10-14T11:59:46Z) - NavDreams: Towards Camera-Only RL Navigation Among Humans [35.57943738219839]
We investigate whether the world model concept, which has shown results for modeling and learning policies in Atari games, can also be applied to the camera-based navigation problem.
We create simulated environments where a robot must navigate past static and moving humans without colliding in order to reach its goal.
We find that state-of-the-art methods are able to achieve success in solving the navigation problem, and can generate dream-like predictions of future image-sequences.
arXiv Detail & Related papers (2022-03-23T09:46:44Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Visual Navigation Among Humans with Optimal Control as a Supervisor [72.5188978268463]
We propose an approach that combines learning-based perception with model-based optimal control to navigate among humans.
Our approach is enabled by our novel data-generation tool, HumANav.
We demonstrate that the learned navigation policies can anticipate and react to humans without explicitly predicting future human motion.
arXiv Detail & Related papers (2020-03-20T16:13:47Z) - 3D Dynamic Scene Graphs: Actionable Spatial Perception with Places,
Objects, and Humans [27.747241700017728]
We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs.
3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction.
arXiv Detail & Related papers (2020-02-15T00:46:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.