LoCUS: Learning Multiscale 3D-consistent Features from Posed Images
- URL: http://arxiv.org/abs/2310.01095v1
- Date: Mon, 2 Oct 2023 11:11:23 GMT
- Title: LoCUS: Learning Multiscale 3D-consistent Features from Posed Images
- Authors: Dominik A. Kloepfer, Dylan Campbell, Jo\~ao F. Henriques
- Abstract summary: We train a versatile neural representation without supervision.
We find that it is possible to balance retrieval and reusability by constructing a retrieval set carefully.
We show results creating sparse, multi-scale, semantic spatial maps.
- Score: 18.648772607057175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An important challenge for autonomous agents such as robots is to maintain a
spatially and temporally consistent model of the world. It must be maintained
through occlusions, previously-unseen views, and long time horizons (e.g., loop
closure and re-identification). It is still an open question how to train such
a versatile neural representation without supervision. We start from the idea
that the training objective can be framed as a patch retrieval problem: given
an image patch in one view of a scene, we would like to retrieve (with high
precision and recall) all patches in other views that map to the same
real-world location. One drawback is that this objective does not promote
reusability of features: by being unique to a scene (achieving perfect
precision/recall), a representation will not be useful in the context of other
scenes. We find that it is possible to balance retrieval and reusability by
constructing the retrieval set carefully, leaving out patches that map to
far-away locations. Similarly, we can easily regulate the scale of the learned
features (e.g., points, objects, or rooms) by adjusting the spatial tolerance
for considering a retrieval to be positive. We optimize for (smooth) Average
Precision (AP), in a single unified ranking-based objective. This objective
also doubles as a criterion for choosing landmarks or keypoints, as patches
with high AP. We show results creating sparse, multi-scale, semantic spatial
maps composed of highly identifiable landmarks, with applications in landmark
retrieval, localization, semantic segmentation and instance segmentation.
Related papers
- Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection [24.00828999360765]
This paper addresses the challenge of robotic grasping of general objects.
The proposed model first runs by proposing a number of most likely grasp points in the scene.
Around each grasp point, a module is designed to infer any voxel in its neighborhood to be either void or occupied by some object.
The model further estimates 6-DoF grasp poses utilizing the local occupancy-enhanced object shape information.
arXiv Detail & Related papers (2024-07-22T16:22:28Z) - Improved Scene Landmark Detection for Camera Localization [11.56648898250606]
Method based on scene landmark detection (SLD) was recently proposed to address these limitations.
It involves training a convolutional neural network (CNN) to detect a few predetermined, salient, scene-specific 3D points or landmarks.
We show that the accuracy gap was due to insufficient model capacity and noisy labels during training.
arXiv Detail & Related papers (2024-01-31T18:59:12Z) - PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching [51.142988196855484]
We propose PoseMatcher, an accurate model free one-shot object pose estimator.
We create a new training pipeline for object to image matching based on a three-view system.
To enable PoseMatcher to attend to distinct input modalities, an image and a pointcloud, we introduce IO-Layer.
arXiv Detail & Related papers (2023-04-03T21:14:59Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Map-free Visual Relocalization: Metric Pose Relative to a Single Image [21.28513803531557]
We propose Map-free Relocalization, using only one photo of a scene to enable instant, metric scaled relocalization.
Existing datasets are not suitable to benchmark map-free relocalization, due to their focus on large scenes or their limited variability.
We have constructed a new dataset of 655 small places of interest, such as sculptures, murals and fountains, collected worldwide.
arXiv Detail & Related papers (2022-10-11T14:49:49Z) - Sparse Semantic Map-Based Monocular Localization in Traffic Scenes Using
Learned 2D-3D Point-Line Correspondences [29.419138863851526]
Given a query image, the goal is to estimate the camera pose corresponding to the prior map.
Existing approaches rely heavily on dense point descriptors at the feature level to solve the registration problem.
We propose a sparse semantic map-based monocular localization method, which solves 2D-3D registration via a well-designed deep neural network.
arXiv Detail & Related papers (2022-10-10T10:29:07Z) - VS-Net: Voting with Segmentation for Visual Localization [72.8165619061249]
We propose a novel visual localization framework that establishes 2D-to-3D correspondences between the query image and the 3D map with a series of learnable scene-specific landmarks.
Our proposed VS-Net is extensively tested on multiple public benchmarks and can outperform state-of-the-art visual localization methods.
arXiv Detail & Related papers (2021-05-23T08:44:11Z) - Scale Normalized Image Pyramids with AutoFocus for Object Detection [75.71320993452372]
A scale normalized image pyramid (SNIP) is generated that, like human vision, only attends to objects within a fixed size range at different scales.
We propose an efficient spatial sub-sampling scheme which only operates on fixed-size sub-regions likely to contain objects.
The resulting algorithm is referred to as AutoFocus and results in a 2.5-5 times speed-up during inference when used with SNIP.
arXiv Detail & Related papers (2021-02-10T18:57:53Z) - Point-Set Anchors for Object Detection, Instance Segmentation and Pose
Estimation [85.96410825961966]
We argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries.
To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions.
We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation.
arXiv Detail & Related papers (2020-07-06T15:59:56Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.