Sequential Place Learning: Heuristic-Free High-Performance Long-Term
Place Recognition
- URL: http://arxiv.org/abs/2103.02074v1
- Date: Tue, 2 Mar 2021 22:57:43 GMT
- Title: Sequential Place Learning: Heuristic-Free High-Performance Long-Term
Place Recognition
- Authors: Marvin Chanc\'an, Michael Milford
- Abstract summary: We develop a learning-based CNN+LSTM architecture, trainable via backpropagation through time, for viewpoint- and appearance-invariant place recognition.
Our model outperforms 15 classical methods while setting new state-of-the-art performance standards.
In addition, we show that SPL can be up to 70x faster to deploy than classical methods on a 729 km route.
- Score: 24.70946979449572
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sequential matching using hand-crafted heuristics has been standard practice
in route-based place recognition for enhancing pairwise similarity results for
nearly a decade. However, precision-recall performance of these algorithms
dramatically degrades when searching on short temporal window (TW) lengths,
while demanding high compute and storage costs on large robotic datasets for
autonomous navigation research. Here, influenced by biological systems that
robustly navigate spacetime scales even without vision, we develop a joint
visual and positional representation learning technique, via a sequential
process, and design a learning-based CNN+LSTM architecture, trainable via
backpropagation through time, for viewpoint- and appearance-invariant place
recognition. Our approach, Sequential Place Learning (SPL), is based on a CNN
function that visually encodes an environment from a single traversal, thus
reducing storage capacity, while an LSTM temporally fuses each visual embedding
with corresponding positional data -- obtained from any source of motion
estimation -- for direct sequential inference. Contrary to classical two-stage
pipelines, e.g., match-then-temporally-filter, our network directly eliminates
false-positive rates while jointly learning sequence matching from a single
monocular image sequence, even using short TWs. Hence, we demonstrate that our
model outperforms 15 classical methods while setting new state-of-the-art
performance standards on 4 challenging benchmark datasets, where one of them
can be considered solved with recall rates of 100% at 100% precision, correctly
matching all places under extreme sunlight-darkness changes. In addition, we
show that SPL can be up to 70x faster to deploy than classical methods on a 729
km route comprising 35,768 consecutive frames. Extensive experiments
demonstrate the... Baseline code available at
https://github.com/mchancan/deepseqslam
Related papers
- SONNET: Enhancing Time Delay Estimation by Leveraging Simulated Audio [17.811771707446926]
We show that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data.
We provide our trained model, SONNET, which is runnable in real-time and works on novel data out of the box for many real data applications.
arXiv Detail & Related papers (2024-11-20T10:23:21Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - A Faster, Lighter and Stronger Deep Learning-Based Approach for Place
Recognition [7.9400442516053475]
We propose a faster, lighter and stronger approach that can generate models with fewer parameters and can spend less time in the inference stage.
We design RepVGG-lite as the backbone network in our architecture, it is more discriminative than other general networks in the Place Recognition task.
Our system has 14 times less params than Patch-NetVLAD, 6.8 times lower theoretical FLOPs, and run faster 21 and 33 times in feature extraction and feature matching.
arXiv Detail & Related papers (2022-11-27T15:46:53Z) - Differentiable Point-Based Radiance Fields for Efficient View Synthesis [57.56579501055479]
We propose a differentiable rendering algorithm for efficient novel view synthesis.
Our method is up to 300x faster than NeRF in both training and inference.
For dynamic scenes, our method trains two orders of magnitude faster than STNeRF and renders at near interactive rate.
arXiv Detail & Related papers (2022-05-28T04:36:13Z) - Reinforcement Learning with Latent Flow [78.74671595139613]
Flow of Latents for Reinforcement Learning (Flare) is a network architecture for RL that explicitly encodes temporal information through latent vector differences.
We show that Flare recovers optimal performance in state-based RL without explicit access to the state velocity.
We also show that Flare achieves state-of-the-art performance on pixel-based challenging continuous control tasks within the DeepMind control benchmark suite.
arXiv Detail & Related papers (2021-01-06T03:50:50Z) - DeepSeqSLAM: A Trainable CNN+RNN for Joint Global Description and
Sequence-based Place Recognition [23.54696982881734]
We propose DeepSeqSLAM: a trainable CNN+NN architecture for jointly learning visual and positional representations from a single image sequence of a route.
We demonstrate our approach on two large benchmark datasets, Nordland and Oxford RobotCar.
Our approach can get over 72% AUC compared to 27% AUC for Delta Descriptors and 2% AUC for SeqSLAM; while drastically reducing the deployment time from around 1 hour to 1 minute against both.
arXiv Detail & Related papers (2020-11-17T09:14:02Z) - Exploiting the ConvLSTM: Human Action Recognition using Raw Depth
Video-Based Recurrent Neural Networks [0.0]
We propose and compare two neural networks based on the convolutional long short-term memory unit, namely ConvLSTM.
We show that the proposed models achieve competitive recognition accuracies with lower computational cost compared with state-of-the-art methods.
arXiv Detail & Related papers (2020-06-13T23:35:59Z) - SUPER: A Novel Lane Detection System [26.417172945374364]
We propose a real-time lane detection system, called Scene Understanding Physics-Enhanced Real-time (SUPER) algorithm.
We train the proposed system using heterogeneous data from Cityscapes, Vistas and Apollo, and evaluate the performance on four completely separate datasets.
Preliminary test results show promising real-time lane-detection performance compared with the Mobileye.
arXiv Detail & Related papers (2020-05-14T21:40:39Z) - Real-Time High-Performance Semantic Image Segmentation of Urban Street
Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes.
The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.