Predicting Semantic Map Representations from Images using Pyramid
Occupancy Networks
- URL: http://arxiv.org/abs/2003.13402v1
- Date: Mon, 30 Mar 2020 12:39:44 GMT
- Title: Predicting Semantic Map Representations from Images using Pyramid
Occupancy Networks
- Authors: Thomas Roddick, Roberto Cipolla
- Abstract summary: We present a simple, unified approach for estimating maps directly from monocular images using a single end-to-end deep learning architecture.
We demonstrate the effectiveness of our approach by evaluating against several challenging baselines on the NuScenes and Argoverse datasets.
- Score: 27.86228863466213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous vehicles commonly rely on highly detailed birds-eye-view maps of
their environment, which capture both static elements of the scene such as road
layout as well as dynamic elements such as other cars and pedestrians.
Generating these map representations on the fly is a complex multi-stage
process which incorporates many important vision-based elements, including
ground plane estimation, road segmentation and 3D object detection. In this
work we present a simple, unified approach for estimating maps directly from
monocular images using a single end-to-end deep learning architecture. For the
maps themselves we adopt a semantic Bayesian occupancy grid framework, allowing
us to trivially accumulate information over multiple cameras and timesteps. We
demonstrate the effectiveness of our approach by evaluating against several
challenging baselines on the NuScenes and Argoverse datasets, and show that we
are able to achieve a relative improvement of 9.1% and 22.3% respectively
compared to the best-performing existing method.
Related papers
- Neural Semantic Map-Learning for Autonomous Vehicles [85.8425492858912]
We present a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment.
Our method jointly aligns and merges the noisy and incomplete local submaps using a scene-specific Neural Signed Distance Field.
We leverage memory-efficient sparse feature-grids to scale to large areas and introduce a confidence score to model uncertainty in scene reconstruction.
arXiv Detail & Related papers (2024-10-10T10:10:03Z) - Visual Localization in 3D Maps: Comparing Point Cloud, Mesh, and NeRF Representations [8.522160106746478]
We present a global visual localization system capable of localizing a single camera image across various 3D map representations.
Our system generates a database by synthesizing novel views of the scene, creating RGB and depth image pairs.
NeRF synthesized images show superior performance, localizing query images at an average success rate of 72%.
arXiv Detail & Related papers (2024-08-21T19:37:17Z) - Pixel to Elevation: Learning to Predict Elevation Maps at Long Range using Images for Autonomous Offroad Navigation [10.898724668444125]
We present a learning-based approach capable of predicting terrain elevation maps at long-range using only onboard egocentric images in real-time.
We experimentally validate the applicability of our proposed approach for autonomous offroad robotic navigation in complex and unstructured terrain.
arXiv Detail & Related papers (2024-01-30T22:37:24Z) - Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z) - Sparse Semantic Map-Based Monocular Localization in Traffic Scenes Using
Learned 2D-3D Point-Line Correspondences [29.419138863851526]
Given a query image, the goal is to estimate the camera pose corresponding to the prior map.
Existing approaches rely heavily on dense point descriptors at the feature level to solve the registration problem.
We propose a sparse semantic map-based monocular localization method, which solves 2D-3D registration via a well-designed deep neural network.
arXiv Detail & Related papers (2022-10-10T10:29:07Z) - Satellite Image Based Cross-view Localization for Autonomous Vehicle [59.72040418584396]
This paper shows that by using an off-the-shelf high-definition satellite image as a ready-to-use map, we are able to achieve cross-view vehicle localization up to a satisfactory accuracy.
Our method is validated on KITTI and Ford Multi-AV Seasonal datasets as ground view and Google Maps as the satellite view.
arXiv Detail & Related papers (2022-07-27T13:16:39Z) - Vision-based Large-scale 3D Semantic Mapping for Autonomous Driving
Applications [53.553924052102126]
We present a complete pipeline for 3D semantic mapping solely based on a stereo camera system.
The pipeline comprises a direct visual odometry front-end as well as a back-end for global temporal integration.
We propose a simple but effective voting scheme which improves the quality and consistency of the 3D point labels.
arXiv Detail & Related papers (2022-03-02T13:18:38Z) - TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view
Stereo [55.30992853477754]
We present TANDEM, a real-time monocular tracking and dense framework.
For pose estimation, TANDEM performs photometric bundle adjustment based on a sliding window of alignments.
TANDEM shows state-of-the-art real-time 3D reconstruction performance.
arXiv Detail & Related papers (2021-11-14T19:01:02Z) - Semantic Image Alignment for Vehicle Localization [111.59616433224662]
We present a novel approach to vehicle localization in dense semantic maps using semantic segmentation from a monocular camera.
In contrast to existing visual localization approaches, the system does not require additional keypoint features, handcrafted localization landmark extractors or expensive LiDAR sensors.
arXiv Detail & Related papers (2021-10-08T14:40:15Z) - Crowdsourced 3D Mapping: A Combined Multi-View Geometry and
Self-Supervised Learning Approach [10.610403488989428]
We propose a framework that estimates the 3D positions of semantically meaningful landmarks without assuming known camera intrinsics.
We utilize multi-view geometry as well as deep learning based self-calibration, depth, and ego-motion estimation for traffic sign positioning.
We achieve an average single-journey relative and absolute positioning accuracy of 39cm and 1.26m respectively.
arXiv Detail & Related papers (2020-07-25T12:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.