Related papers: Progressive Query Refinement Framework for Bird's-Eye-View Semantic Segmentation from Surrounding Images

Progressive Query Refinement Framework for Bird's-Eye-View Semantic Segmentation from Surrounding Images

URL: http://arxiv.org/abs/2407.17003v1
Date: Wed, 24 Jul 2024 05:00:31 GMT
Title: Progressive Query Refinement Framework for Bird's-Eye-View Semantic Segmentation from Surrounding Images
Authors: Dooseop Choi, Jungyu Kang, Taeghyun An, Kyounghwan Ahn, KyoungWook Min,
Abstract summary: We introduce the Multi-Resolution (MR) concept into Bird's-Eye-View (BEV) semantic segmentation for autonomous driving. We propose a visual feature interaction network that promotes interactions between features across images and across feature levels. We evaluate our model on a large-scale real-world dataset.
Score: 3.495246564946556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Expressing images with Multi-Resolution (MR) features has been widely adopted in many computer vision tasks. In this paper, we introduce the MR concept into Bird's-Eye-View (BEV) semantic segmentation for autonomous driving. This introduction enhances our model's ability to capture both global and local characteristics of driving scenes through our proposed residual learning. Specifically, given a set of MR BEV query maps, the lowest resolution query map is initially updated using a View Transformation (VT) encoder. This updated query map is then upscaled and merged with a higher resolution query map to undergo further updates in a subsequent VT encoder. This process is repeated until the resolution of the updated query map reaches the target. Finally, the lowest resolution map is added to the target resolution to generate the final query map. During training, we enforce both the lowest and final query maps to align with the ground-truth BEV semantic map to help our model effectively capture the global and local characteristics. We also propose a visual feature interaction network that promotes interactions between features across images and across feature levels, thus highly contributing to the performance improvement. We evaluate our model on a large-scale real-world dataset. The experimental results show that our model outperforms the SOTA models in terms of IoU metric. Codes are available at https://github.com/d1024choi/ProgressiveQueryRefineNet

Related papers

TopoSD: Topology-Enhanced Lane Segment Perception with SDMap Prior [70.84644266024571]
We propose to train a perception model to "see" standard definition maps (SDMaps) We encode SDMap elements into neural spatial map representations and instance tokens, and then incorporate such complementary features as prior information. Based on the lane segment representation framework, the model simultaneously predicts lanes, centrelines and their topology.
arXiv Detail & Related papers (2024-11-22T06:13:42Z)
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization [108.68014173017583]
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car. We propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens
arXiv Detail & Related papers (2024-11-03T16:09:47Z)
Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data [3.1968751101341173]
Top-down Bird's Eye View (BEV) maps are a popular representation for ground robot navigation. While recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets. We show that a more scalable approach towards generalizable map prediction can be enabled by using two large-scale crowd-sourced mapping platforms.
arXiv Detail & Related papers (2024-07-11T17:57:22Z)
SemVecNet: Generalizable Vector Map Generation for Arbitrary Sensor Configurations [3.8472678261304587]
We propose a modular pipeline for vector map generation with improved generalization to sensor configurations. By adopting a BEV semantic map robust to different sensor configurations, our proposed approach significantly improves the generalization performance.
arXiv Detail & Related papers (2024-04-30T23:45:16Z)
Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images. Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement. Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet) Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z)
NeMO: Neural Map Growing System for Spatiotemporal Fusion in Bird's-Eye-View and BDD-Map Benchmark [9.430779563669908]
Vision-centric Bird's-Eye View representation is essential for autonomous driving systems. This work outlines a new paradigm, named NeMO, for generating local maps through the utilization of a readable and writable big map. With an assumption that the feature distribution of all BEV grids follows an identical pattern, we adopt a shared-weight neural network for all grids to update the big map.
arXiv Detail & Related papers (2023-06-07T15:46:15Z)
Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections. We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing. We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z)
BEV-Locator: An End-to-end Visual Semantic Localization Network Using Multi-View Images [13.258689143949912]
We propose an end-to-end visual semantic localization neural network using multi-view camera images. The BEV-Locator is capable to estimate the vehicle poses under versatile scenarios. Experiments report satisfactory accuracy with mean absolute errors of 0.052m, 0.135m and 0.251$circ$ in lateral, longitudinal translation and heading angle degree.
arXiv Detail & Related papers (2022-11-27T20:24:56Z)
Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view. Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z)
LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation [43.12994451281451]
We present 'LaRa', an efficient encoder-decoder, transformer-based model for vehicle semantic segmentation from multiple cameras. Our approach uses a system of cross-attention to aggregate information over multiple sensors into a compact, yet rich, collection of latent representations.
arXiv Detail & Related papers (2022-06-27T13:37:50Z)
TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo [55.30992853477754]
We present TANDEM, a real-time monocular tracking and dense framework. For pose estimation, TANDEM performs photometric bundle adjustment based on a sliding window of alignments. TANDEM shows state-of-the-art real-time 3D reconstruction performance.
arXiv Detail & Related papers (2021-11-14T19:01:02Z)
Diff-Net: Image Feature Difference based High-Definition Map Change Detection [13.666189678747996]
Up-to-date High-Definition (HD) maps are essential for self-driving cars. We present a deep neural network (DNN), Diff-Net, to detect changes in them. Results demonstrate that our Diff-Net achieves better performance than the baseline methods and is ready to be integrated into a map production maintaining an up-to-date HD map.
arXiv Detail & Related papers (2021-07-14T22:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.