Related papers: Self-localization on a 3D map by fusing global and local features from a monocular camera

Self-localization on a 3D map by fusing global and local features from a monocular camera

URL: http://arxiv.org/abs/2510.26170v1
Date: Thu, 30 Oct 2025 06:14:22 GMT
Title: Self-localization on a 3D map by fusing global and local features from a monocular camera
Authors: Satoshi Kikuch, Masaya Kato, Tsuyoshi Tasaki,
Abstract summary: Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels.<n>This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself with 7.51cm error on average, which is more accurate than SOTA.

Related papers

Real World Robotic Exploration using Deep Neural Networks Trained in Photorealistic Reconstructed Environments [1.3053649021965599]
An existing deep neural network approach for determining a robot's pose from visual information (RGB images) is modified.<n>Photogrammetry data is used to produce a pose-labelled dataset which allows the above model to be trained on a local environment.<n>This trained model forms the basis of a navigation algorithm, which is tested in real-time on a TurtleBot.
arXiv Detail & Related papers (2025-09-12T00:03:04Z)
PoseINN: Realtime Visual-based Pose Regression and Localization with Invertible Neural Networks [3.031375888004876]
Estimating ego-pose from cameras is an important problem in robotics with applications ranging from mobile robotics to augmented reality. We propose to solve the problem by using invertible neural networks (INN) to find the mapping between the latent space of images and poses for a given scene. Our model achieves similar performance to the SOTA while being faster to train and only requiring offline rendering of low-resolution synthetic data.
arXiv Detail & Related papers (2024-04-20T06:25:32Z)
UnLoc: A Universal Localization Method for Autonomous Vehicles using LiDAR, Radar and/or Camera Input [51.150605800173366]
UnLoc is a novel unified neural modeling approach for localization with multi-sensor input in all weather conditions. Our method is extensively evaluated on Oxford Radar RobotCar, ApolloSouthBay and Perth-WA datasets.
arXiv Detail & Related papers (2023-07-03T04:10:55Z)
Neural Scene Representation for Locomotion on Structured Terrain [56.48607865960868]
We propose a learning-based method to reconstruct the local terrain for a mobile robot traversing urban environments. Using a stream of depth measurements from the onboard cameras and the robot's trajectory, the estimates the topography in the robot's vicinity. We propose a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement.
arXiv Detail & Related papers (2022-06-16T10:45:17Z)
TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization [81.70547404891099]
CNN-based methods for cross-view image geo-localization fail to model global correlation. We propose a pure transformer-based approach (TransGeo) to address these limitations. TransGeo achieves state-of-the-art results on both urban and rural datasets.
arXiv Detail & Related papers (2022-03-31T21:19:41Z)
Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z)
Monocular Camera Localization for Automated Vehicles Using Image Retrieval [8.594652891734288]
We address the problem of finding the current position and heading angle of an autonomous vehicle in real-time using a single camera. Compared to methods which require LiDARs and high definition (HD) 3D maps in real-time, the proposed approach is easily scalable and computationally efficient.
arXiv Detail & Related papers (2021-09-13T20:12:42Z)
CoordiNet: uncertainty-aware pose regressor for reliable vehicle localization [3.4386226615580107]
We investigate visual-based camera localization with neural networks for robotics and autonomous vehicles applications. Our solution is a CNN-based algorithm which predicts camera pose directly from a single image. We show that our proposal is a reliable alternative, achieving 29cm median error in a 1.9km loop in a busy urban area.
arXiv Detail & Related papers (2021-03-19T13:32:40Z)
SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection. We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors. Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z)
3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture. SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together. Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z)
RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs. For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.