SaliencyI2PLoc: saliency-guided image-point cloud localization using contrastive learning
- URL: http://arxiv.org/abs/2412.15577v1
- Date: Fri, 20 Dec 2024 05:20:10 GMT
- Title: SaliencyI2PLoc: saliency-guided image-point cloud localization using contrastive learning
- Authors: Yuhao Li, Jianping Li, Zhen Dong, Yuan Wang, Bisheng Yang,
- Abstract summary: SaliencyI2PLoc is a contrastive learning architecture that fuses the saliency map into feature aggregation.
Our method achieves a Recall@1 of 78.92% and a Recall@20 of 97.59% on the urban scenario evaluation dataset.
- Score: 17.29563451509921
- License:
- Abstract: Image to point cloud global localization is crucial for robot navigation in GNSS-denied environments and has become increasingly important for multi-robot map fusion and urban asset management. The modality gap between images and point clouds poses significant challenges for cross-modality fusion. Current cross-modality global localization solutions either require modality unification, which leads to information loss, or rely on engineered training schemes to encode multi-modality features, which often lack feature alignment and relation consistency. To address these limitations, we propose, SaliencyI2PLoc, a novel contrastive learning based architecture that fuses the saliency map into feature aggregation and maintains the feature relation consistency on multi-manifold spaces. To alleviate the pre-process of data mining, the contrastive learning framework is applied which efficiently achieves cross-modality feature mapping. The context saliency-guided local feature aggregation module is designed, which fully leverages the contribution of the stationary information in the scene generating a more representative global feature. Furthermore, to enhance the cross-modality feature alignment during contrastive learning, the consistency of relative relationships between samples in different manifold spaces is also taken into account. Experiments conducted on urban and highway scenario datasets demonstrate the effectiveness and robustness of our method. Specifically, our method achieves a Recall@1 of 78.92% and a Recall@20 of 97.59% on the urban scenario evaluation dataset, showing an improvement of 37.35% and 18.07%, compared to the baseline method. This demonstrates that our architecture efficiently fuses images and point clouds and represents a significant step forward in cross-modality global localization. The project page and code will be released.
Related papers
- Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization [2.733505168507872]
UAV-View Geo-Localization aims to ascertain the precise location of a UAV by retrieving the most similar GPS-tagged satellite image.
Existing methods rely on supervised learning paradigms that necessitate annotated paired data for training.
We propose the Dynamic Memory-Driven and Neighborhood Information Learning network, a lightweight end-to-end self-supervised framework for UAV-view geo-localization.
arXiv Detail & Related papers (2025-02-17T02:53:08Z) - Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization [2.733505168507872]
Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images.
Existing methods often overlook the problem of increased computational and storage requirements when improving model performance.
We propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN)
arXiv Detail & Related papers (2024-12-19T13:10:38Z) - World-Consistent Data Generation for Vision-and-Language Navigation [52.08816337783936]
Vision-and-Language Navigation (VLN) is a challenging task that requires an agent to navigate through photorealistic environments following natural-language instructions.
One main obstacle existing in VLN is data scarcity, leading to poor generalization performance over unseen environments.
We propose the world-consistent data generation (WCGEN), an efficacious data-augmentation framework satisfying both diversity and world-consistency.
arXiv Detail & Related papers (2024-12-09T11:40:54Z) - Localization, balance and affinity: a stronger multifaceted collaborative salient object detector in remote sensing images [24.06927394483275]
We propose a stronger multifaceted collaborative salient object detector in ORSIs, termed LBA-MCNet.
The network focuses on accurately locating targets, balancing detailed features, and modeling image-level global context information.
arXiv Detail & Related papers (2024-10-31T14:50:48Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Multi-Temporal Relationship Inference in Urban Areas [75.86026742632528]
Finding temporal relationships among locations can benefit a bunch of urban applications, such as dynamic offline advertising and smart public transport planning.
We propose a solution to Trial with a graph learning scheme, which includes a spatially evolving graph neural network (SEENet)
SEConv performs the intra-time aggregation and inter-time propagation to capture the multifaceted spatially evolving contexts from the view of location message passing.
SE-SSL designs time-aware self-supervised learning tasks in a global-local manner with additional evolving constraint to enhance the location representation learning and further handle the relationship sparsity.
arXiv Detail & Related papers (2023-06-15T07:48:32Z) - Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.
Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z) - Global-and-Local Collaborative Learning for Co-Salient Object Detection [162.62642867056385]
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images.
We propose a global-and-local collaborative learning architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM)
The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms eleven state-of-the-art competitors trained on some large datasets (about 8k-200k images)
arXiv Detail & Related papers (2022-04-19T14:32:41Z) - Video Salient Object Detection via Adaptive Local-Global Refinement [7.723369608197167]
Video salient object detection (VSOD) is an important task in many vision applications.
We propose an adaptive local-global refinement framework for VSOD.
We show that our weighting methodology can further exploit the feature correlations, thus driving the network to learn more discriminative feature representation.
arXiv Detail & Related papers (2021-04-29T14:14:11Z) - Multiple Object Tracking with Correlation Learning [16.959379957515974]
We propose to exploit the local correlation module to model the topological relationship between targets and their surrounding environment.
Specifically, we establish dense correspondences of each spatial location and its context, and explicitly constrain the correlation volumes through self-supervised learning.
Our approach demonstrates the effectiveness of correlation learning with the superior performance and obtains state-of-the-art MOTA of 76.5% and IDF1 of 73.6% on MOT17.
arXiv Detail & Related papers (2021-04-08T06:48:02Z) - Global Context-Aware Progressive Aggregation Network for Salient Object
Detection [117.943116761278]
We propose a novel network named GCPANet to integrate low-level appearance features, high-level semantic features, and global context features.
We show that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-02T04:26:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.