LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization
- URL: http://arxiv.org/abs/2312.16648v1
- Date: Wed, 27 Dec 2023 17:23:57 GMT
- Title: LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization
- Authors: Sai Shubodh Puligilla, Mohammad Omama, Husain Zaidi, Udit Singh
Parihar and Madhava Krishna
- Abstract summary: We apply Contrastive Language-Image Pre-Training to the domains of 2D image and 3D LiDAR points on the task of cross-modal localization.
Our method outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by 22.4%, using only perspective images.
We also demonstrate the zero-shot capabilities of our model and we beat SOTA by 8% without even training on it.
- Score: 0.9562145896371785
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Global visual localization in LiDAR-maps, crucial for autonomous driving
applications, remains largely unexplored due to the challenging issue of
bridging the cross-modal heterogeneity gap. Popular multi-modal learning
approach Contrastive Language-Image Pre-Training (CLIP) has popularized
contrastive symmetric loss using batch construction technique by applying it to
multi-modal domains of text and image. We apply this approach to the domains of
2D image and 3D LiDAR points on the task of cross-modal localization. Our
method is explained as follows: A batch of N (image, LiDAR) pairs is
constructed so as to predict what is the right match between N X N possible
pairings across the batch by jointly training an image encoder and LiDAR
encoder to learn a multi-modal embedding space. In this way, the cosine
similarity between N positive pairings is maximized, whereas that between the
remaining negative pairings is minimized. Finally, over the obtained similarity
scores, a symmetric cross-entropy loss is optimized. To the best of our
knowledge, this is the first work to apply batched loss approach to a
cross-modal setting of image & LiDAR data and also to show Zero-shot transfer
in a visual localization setting. We conduct extensive analyses on standard
autonomous driving datasets such as KITTI and KITTI-360 datasets. Our method
outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by
22.4%, using only perspective images, in contrast to the state-of-the-art
approach, which utilizes the more informative fisheye images. Additionally,
this superior performance is achieved without resorting to complex
architectures. Moreover, we demonstrate the zero-shot capabilities of our model
and we beat SOTA by 8% without even training on it. Furthermore, we establish
the first benchmark for cross-modal localization on the KITTI dataset.
Related papers
- LiOn-XA: Unsupervised Domain Adaptation via LiDAR-Only Cross-Modal Adversarial Training [61.26381389532653]
LiOn-XA is an unsupervised domain adaptation (UDA) approach that combines LiDAR-Only Cross-Modal (X) learning with Adversarial training for 3D LiDAR point cloud semantic segmentation.
Our experiments on 3 real-to-real adaptation scenarios demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-21T09:50:17Z) - Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space.
We propose an effective approach to narrow the gap between the two domains.
It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z) - Symmetrical Bidirectional Knowledge Alignment for Zero-Shot Sketch-Based
Image Retrieval [69.46139774646308]
This paper studies the problem of zero-shot sketch-based image retrieval (ZS-SBIR)
It aims to use sketches from unseen categories as queries to match the images of the same category.
We propose a novel Symmetrical Bidirectional Knowledge Alignment for zero-shot sketch-based image retrieval (SBKA)
arXiv Detail & Related papers (2023-12-16T04:50:34Z) - A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with
Batch Normalization and Knowledge Distillation [3.364554138758565]
Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query.
We introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome limitations through loss weighting based on anchors similarity.
We propose a straightforward approach to train small models efficiently with a marginal loss of accuracy through knowledge distillation.
arXiv Detail & Related papers (2023-05-30T12:41:04Z) - Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - S2-Net: Self-supervision Guided Feature Representation Learning for
Cross-Modality Images [0.0]
Cross-modality image pairs often fail to make the feature representations of correspondences as close as possible.
In this letter, we design a cross-modality feature representation learning network, S2-Net, which is based on the recently successful detect-and-describe pipeline.
We introduce self-supervised learning with a well-designed loss function to guide the training without discarding the original advantages.
arXiv Detail & Related papers (2022-03-28T08:47:49Z) - Modality-Aware Triplet Hard Mining for Zero-shot Sketch-Based Image
Retrieval [51.42470171051007]
This paper tackles the Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) problem from the viewpoint of cross-modality metric learning.
By combining two fundamental learning approaches in DML, e.g., classification training and pairwise training, we set up a strong baseline for ZS-SBIR.
We show that Modality-Aware Triplet Hard Mining (MATHM) enhances the baseline with three types of pairwise learning.
arXiv Detail & Related papers (2021-12-15T08:36:44Z) - Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning.
We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.