MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning
- URL: http://arxiv.org/abs/2506.15313v1
- Date: Wed, 18 Jun 2025 09:42:30 GMT
- Title: MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning
- Authors: Leonid Ivanov, Vasily Yuryev, Dmitry Yudin,
- Abstract summary: In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making.<n>This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.
Related papers
- DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion [14.872416661028144]
We propose DiffSemanticFusion -- a fusion framework for trajectory prediction and planning.<n>Our approach reasons over a semantic-fused BEV space, enhanced by a map diffusion module.<n>Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods.
arXiv Detail & Related papers (2025-08-03T14:32:05Z) - Unified Dense Prediction of Video Diffusion [91.16237431830417]
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts.<n>We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation.
arXiv Detail & Related papers (2025-03-12T12:41:02Z) - Leveraging V2X for Collaborative HD Maps Construction Using Scene Graph Generation [0.0]
HD maps play a crucial role in autonomous vehicle navigation, complementing onboard perception sensors for improved accuracy and safety.<n>Traditional HD map generation relies on dedicated mapping vehicles, which are costly and fail to capture real-time infrastructure changes.<n>This paper presents HDMapLaneNet, a novel framework leveraging V2X communication and Scene Graph Generation to collaboratively construct a localized geometric layer of HD maps.
arXiv Detail & Related papers (2025-02-14T12:56:10Z) - TopoSD: Topology-Enhanced Lane Segment Perception with SDMap Prior [70.84644266024571]
We propose to train a perception model to "see" standard definition maps (SDMaps)
We encode SDMap elements into neural spatial map representations and instance tokens, and then incorporate such complementary features as prior information.
Based on the lane segment representation framework, the model simultaneously predicts lanes, centrelines and their topology.
arXiv Detail & Related papers (2024-11-22T06:13:42Z) - VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization [108.68014173017583]
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car.
We propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space.
Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens
arXiv Detail & Related papers (2024-11-03T16:09:47Z) - Progressive Query Refinement Framework for Bird's-Eye-View Semantic Segmentation from Surrounding Images [3.495246564946556]
We introduce the Multi-Resolution (MR) concept into Bird's-Eye-View (BEV) semantic segmentation for autonomous driving.
We propose a visual feature interaction network that promotes interactions between features across images and across feature levels.
We evaluate our model on a large-scale real-world dataset.
arXiv Detail & Related papers (2024-07-24T05:00:31Z) - Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data [3.1968751101341173]
Top-down Bird's Eye View (BEV) maps are a popular representation for ground robot navigation.<n>While recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets.<n>We show that a more scalable approach towards generalizable map prediction can be enabled by using two large-scale crowd-sourced mapping platforms.
arXiv Detail & Related papers (2024-07-11T17:57:22Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - MV-Map: Offboard HD-Map Generation with Multi-view Consistency [29.797769409113105]
Bird's-eye-view (BEV) perception models can be useful for building high-definition maps (HD-Maps) with less human labor.
Their results are often unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps from different viewpoints.
This paper advocates a more practical 'offboard' HD-Map generation setup that removes the computation constraints.
arXiv Detail & Related papers (2023-05-15T17:59:15Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps [81.86923212296863]
HD maps are maps with precise definitions of road lanes with rich semantics of the traffic rules.
There are only a small amount of real-world road topologies and geometries, which significantly limits our ability to test out the self-driving stack.
We propose HDMapGen, a hierarchical graph generation model capable of producing high-quality and diverse HD maps.
arXiv Detail & Related papers (2021-06-28T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.