Related papers: Single-Frame Point-Pixel Registration via Supervised Cross-Modal Feature Matching

Single-Frame Point-Pixel Registration via Supervised Cross-Modal Feature Matching

URL: http://arxiv.org/abs/2506.22784v1
Date: Sat, 28 Jun 2025 06:57:13 GMT
Title: Single-Frame Point-Pixel Registration via Supervised Cross-Modal Feature Matching
Authors: Yu Han, Zhiwei Huang, Yanting Zhang, Fangjun Ding, Shen Cai, Rui Fan,
Abstract summary: We introduce a detector-free framework for direct point-pixel matching between LiDAR and camera views.<n>Specifically, we project the LiDAR intensity map into a 2D view from the LiDAR perspective and feed it into an attention-based matching network.<n>To further enhance matching reliability, we introduce a repeatability scoring mechanism that acts as a soft visibility prior.
Score: 7.5461100059974315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Point-pixel registration between LiDAR point clouds and camera images is a fundamental yet challenging task in autonomous driving and robotic perception. A key difficulty lies in the modality gap between unstructured point clouds and structured images, especially under sparse single-frame LiDAR settings. Existing methods typically extract features separately from point clouds and images, then rely on hand-crafted or learned matching strategies. This separate encoding fails to bridge the modality gap effectively, and more critically, these methods struggle with the sparsity and noise of single-frame LiDAR, often requiring point cloud accumulation or additional priors to improve reliability. Inspired by recent progress in detector-free matching paradigms (e.g. MatchAnything), we revisit the projection-based approach and introduce the detector-free framework for direct point-pixel matching between LiDAR and camera views. Specifically, we project the LiDAR intensity map into a 2D view from the LiDAR perspective and feed it into an attention-based detector-free matching network, enabling cross-modal correspondence estimation without relying on multi-frame accumulation. To further enhance matching reliability, we introduce a repeatability scoring mechanism that acts as a soft visibility prior. This guides the network to suppress unreliable matches in regions with low intensity variation, improving robustness under sparse input. Extensive experiments on KITTI, nuScenes, and MIAS-LCEC-TF70 benchmarks demonstrate that our method achieves state-of-the-art performance, outperforming prior approaches on nuScenes (even those relying on accumulated point clouds), despite using only single-frame LiDAR.

Related papers

Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras [56.39904484784127]
We propose an approach for estimating the relative pose between rolling shutter cameras using the intersections of line projections with a single scanline per image.<n>Alternatively, scanlines can be selected within a single image, enabling single-view relative pose estimation for scanlines of rolling shutter cameras.
arXiv Detail & Related papers (2025-06-27T10:00:21Z)
AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [58.67129770371016]
We propose a novel IRSTD framework that reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization.<n>AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy.
arXiv Detail & Related papers (2025-05-21T07:02:05Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
PAPI-Reg: Patch-to-Pixel Solution for Efficient Cross-Modal Registration between LiDAR Point Cloud and Camera Image [10.906218491083576]
Cross-modal data fusion involves the precise alignment of data from different sensors.<n>We propose a framework that projects point clouds into several 2D representations for matching with camera images.<n>To tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network.
arXiv Detail & Related papers (2025-03-19T15:04:01Z)
EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds [10.324549723042338]
Cross-modal data registration has long been a critical task in computer vision.<n>We propose a method that uses edge information from the original point clouds and images for cross-modal registration.<n>We validate our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.
arXiv Detail & Related papers (2025-03-19T15:03:41Z)
LPRnet: A self-supervised registration network for LiDAR and photogrammetric point clouds [38.42527849407057]
LiDAR and photogrammetry are active and passive remote sensing techniques for point cloud acquisition, respectively.<n>Due to the fundamental differences in sensing mechanisms, spatial distributions and coordinate systems, their point clouds exhibit significant discrepancies in density, precision, noise, and overlap.<n>This paper proposes a self-supervised registration network based on a masked autoencoder, focusing on heterogeneous LiDAR and photogrammetric point clouds.
arXiv Detail & Related papers (2025-01-10T02:36:37Z)
A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration [9.609585217048664]
We develop a consistency-aware spot-guided Transformer (CAST) CAST incorporates a spot-guided cross-attention module to avoid interfering with irrelevant areas. A lightweight fine matching module for both sparse keypoints and dense features can estimate the transformation accurately.
arXiv Detail & Related papers (2024-10-14T08:48:25Z)
From One to Many: Dynamic Cross Attention Networks for LiDAR and Camera Fusion [12.792769704561024]
Existing fusion methods tend to align each 3D point to only one projected image pixel based on calibration. We propose a Dynamic Cross Attention (DCA) module with a novel one-to-many cross-modality mapping. The whole fusion architecture named Dynamic Cross Attention Network (DCAN) exploits multi-level image features and adapts to multiple representations of point clouds.
arXiv Detail & Related papers (2022-09-25T16:10:14Z)
Boosting 3D Object Detection by Simulating Multimodality on Point Clouds [51.87740119160152]
This paper presents a new approach to boost a single-modality (LiDAR) 3D object detector by teaching it to simulate features and responses that follow a multi-modality (LiDAR-image) detector. The approach needs LiDAR-image data only when training the single-modality detector, and once well-trained, it only needs LiDAR data at inference. Experimental results on the nuScenes dataset show that our approach outperforms all SOTA LiDAR-only 3D detectors.
arXiv Detail & Related papers (2022-06-30T01:44:30Z)
LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation [78.74202673902303]
We propose a coarse-tofine LiDAR and camera fusion-based network (termed as LIF-Seg) for LiDAR segmentation. The proposed method fully utilizes the contextual information of images and introduces a simple but effective early-fusion strategy. The cooperation of these two components leads to the success of the effective camera-LiDAR fusion.
arXiv Detail & Related papers (2021-08-17T08:53:11Z)
Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning. We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.