Related papers: D2S: Representing sparse descriptors and 3D coordinates for camera relocalization

D2S: Representing sparse descriptors and 3D coordinates for camera relocalization

URL: http://arxiv.org/abs/2307.15250v3
Date: Fri, 12 Jul 2024 17:10:55 GMT
Title: D2S: Representing sparse descriptors and 3D coordinates for camera relocalization
Authors: Bach-Thuan Bui, Huy-Hoang Bui, Dinh-Tuan Tran, Joo-Ho Lee,
Abstract summary: We propose a learning-based approach to represent complex local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. Our approach outperforms the state-of-the-art CNN-based methods in scene coordinate regression in indoor and outdoor environments.
Score: 1.2974519529978974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant costs in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent complex local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a simple outdoor dataset to evaluate the capabilities of visual localization methods in scene-specific generalization and self-updating from unlabeled observations. Our approach outperforms the state-of-the-art CNN-based methods in scene coordinate regression in indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts, even in the absence of the labeled data sources. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s.

Related papers

A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios [1.2093553114715083]
A-ScoRe is an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible.
arXiv Detail & Related papers (2025-03-18T07:39:50Z)
NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features [50.212836834889146]
We propose an efficient and novel visual localization approach based on the neural implicit map with complementary features. Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field. To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields.
arXiv Detail & Related papers (2025-03-08T08:04:27Z)
A2-GNN: Angle-Annular GNN for Visual Descriptor-free Camera Relocalization [8.881372153385028]
This paper introduces the Angle-Annular Graph Neural Network (A2-GNN), a simple approach that efficiently learns robust geometric structural representations. Our approach achieves state-of-the-art accuracy with low computational overhead among visual description-free methods.
arXiv Detail & Related papers (2025-02-27T12:25:30Z)
FUSELOC: Fusing Global and Local Descriptors to Disambiguate 2D-3D Matching in Visual Localization [57.59857784298536]
Direct 2D-3D matching algorithms require significantly less memory but suffer from lower accuracy due to the larger and more ambiguous search space. We address this ambiguity by fusing local and global descriptors using a weighted average operator within a 2D-3D search framework. We consistently improve the accuracy over local-only systems and achieve performance close to hierarchical methods while halving memory requirements.
arXiv Detail & Related papers (2024-08-21T23:42:16Z)
Improved Scene Landmark Detection for Camera Localization [11.56648898250606]
Method based on scene landmark detection (SLD) was recently proposed to address these limitations. It involves training a convolutional neural network (CNN) to detect a few predetermined, salient, scene-specific 3D points or landmarks. We show that the accuracy gap was due to insufficient model capacity and noisy labels during training.
arXiv Detail & Related papers (2024-01-31T18:59:12Z)
ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames. Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z)
NeRF-Loc: Visual Localization with Conditional Neural Radiance Field [25.319374695362267]
We propose a novel visual re-localization method based on direct matching between implicit 3D descriptors and the 2D image with transformer. Experiments show that our method achieves higher localization accuracy than other learning-based approaches on multiple benchmarks.
arXiv Detail & Related papers (2023-04-17T03:53:02Z)
A Unified BEV Model for Joint Learning of 3D Local Features and Overlap Estimation [12.499361832561634]
We present a unified bird's-eye view (BEV) model for jointly learning of 3D local features and overlap estimation. Our method significantly outperforms existing methods on overlap prediction, especially in scenes with small overlaps.
arXiv Detail & Related papers (2023-02-28T12:01:16Z)
Flattening-Net: Deep Regular 2D Representation for 3D Point Cloud Analysis [66.49788145564004]
We present an unsupervised deep neural architecture called Flattening-Net to represent irregular 3D point clouds of arbitrary geometry and topology. Our methods perform favorably against the current state-of-the-art competitors.
arXiv Detail & Related papers (2022-12-17T15:05:25Z)
Fast and Lightweight Scene Regressor for Camera Relocalization [1.6708069984516967]
Estimating the camera pose directly with respect to pre-built 3D models can be prohibitively expensive for several applications. This study proposes a simple scene regression method that requires only a multi-layer perceptron network for mapping scene coordinates. The proposed approach uses sparse descriptors to regress the scene coordinates, instead of a dense RGB image.
arXiv Detail & Related papers (2022-12-04T14:41:20Z)
SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space [50.14426188851305]
We propose the first SE(3)-equivariant coordinate-based network for learning occupancy fields from point clouds. In contrast to previous shape reconstruction methods that align the input to a regular grid, we operate directly on the irregular, unoriented point cloud. We show that our method outperforms previous SO(3)-equivariant methods, as well as non-equivariant methods trained on SO(3)-augmented datasets.
arXiv Detail & Related papers (2022-04-05T17:59:15Z)
Distinctive 3D local deep descriptors [2.512827436728378]
Point cloud patches are extracted, canonicalised with respect to their estimated local reference frame and encoded by a PointNet-based deep neural network. We evaluate and compare DIPs against alternative hand-crafted and deep descriptors on several datasets consisting of point clouds reconstructed using different sensors.
arXiv Detail & Related papers (2020-09-01T06:25:06Z)
Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem. We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion. A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z)
DH3D: Deep Hierarchical 3D Descriptors for Robust Large-Scale 6DoF Relocalization [56.15308829924527]
We propose a Siamese network that jointly learns 3D local feature detection and description directly from raw 3D points. For detecting 3D keypoints we predict the discriminativeness of the local descriptors in an unsupervised manner. Experiments on various benchmarks demonstrate that our method achieves competitive results for both global point cloud retrieval and local point cloud registration.
arXiv Detail & Related papers (2020-07-17T20:21:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.