Related papers: DeepSeqSLAM: A Trainable CNN+RNN for Joint Global Description and Sequence-based Place Recognition

DeepSeqSLAM: A Trainable CNN+RNN for Joint Global Description and Sequence-based Place Recognition

URL: http://arxiv.org/abs/2011.08518v1
Date: Tue, 17 Nov 2020 09:14:02 GMT
Title: DeepSeqSLAM: A Trainable CNN+RNN for Joint Global Description and Sequence-based Place Recognition
Authors: Marvin Chanc\'an, Michael Milford
Abstract summary: We propose DeepSeqSLAM: a trainable CNN+NN architecture for jointly learning visual and positional representations from a single image sequence of a route. We demonstrate our approach on two large benchmark datasets, Nordland and Oxford RobotCar. Our approach can get over 72% AUC compared to 27% AUC for Delta Descriptors and 2% AUC for SeqSLAM; while drastically reducing the deployment time from around 1 hour to 1 minute against both.
Score: 23.54696982881734
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sequence-based place recognition methods for all-weather navigation are well-known for producing state-of-the-art results under challenging day-night or summer-winter transitions. These systems, however, rely on complex handcrafted heuristics for sequential matching - which are applied on top of a pre-computed pairwise similarity matrix between reference and query image sequences of a single route - to further reduce false-positive rates compared to single-frame retrieval methods. As a result, performing multi-frame place recognition can be extremely slow for deployment on autonomous vehicles or evaluation on large datasets, and fail when using relatively short parameter values such as a sequence length of 2 frames. In this paper, we propose DeepSeqSLAM: a trainable CNN+RNN architecture for jointly learning visual and positional representations from a single monocular image sequence of a route. We demonstrate our approach on two large benchmark datasets, Nordland and Oxford RobotCar - recorded over 728 km and 10 km routes, respectively, each during 1 year with multiple seasons, weather, and lighting conditions. On Nordland, we compare our method to two state-of-the-art sequence-based methods across the entire route under summer-winter changes using a sequence length of 2 and show that our approach can get over 72% AUC compared to 27% AUC for Delta Descriptors and 2% AUC for SeqSLAM; while drastically reducing the deployment time from around 1 hour to 1 minute against both. The framework code and video are available at https://mchancan.github.io/deepseqslam

Related papers

TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos [50.04992164981131]
Temporal localization in untrimmed videos is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. We propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks.
arXiv Detail & Related papers (2025-03-09T09:11:26Z)
RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation [46.659592045271125]
RTMO is a one-stage pose estimation framework that seamlessly integrates coordinate classification. It achieves accuracy comparable to top-down methods while maintaining high speed. Our largest model, RTMO-l, attains 74.8% AP on COCO val 2017 and 141 FPS on a single V100 GPU.
arXiv Detail & Related papers (2023-12-12T18:55:29Z)
Unified Coarse-to-Fine Alignment for Video-Text Retrieval [71.85966033484597]
We propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Our model captures the cross-modal similarity information at different granularity levels. We apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them.
arXiv Detail & Related papers (2023-09-18T19:04:37Z)
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z)
SFNet: Faster and Accurate Semantic Segmentation via Semantic Flow [88.97790684009979]
A common practice to improve the performance is to attain high-resolution feature maps with strong semantic representation. We propose a Flow Alignment Module (FAM) to learn textitSemantic Flow between feature maps of adjacent levels. We also present a novel Gated Dual Flow Alignment Module to directly align high-resolution feature maps and low-resolution feature maps.
arXiv Detail & Related papers (2022-07-10T08:25:47Z)
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K. Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z)
DTWSSE: Data Augmentation with a Siamese Encoder for Time Series [8.019203034348083]
We propose a DTW-based synthetic minority oversampling technique using siamese encoder for named DTWSSE. In order to reasonably measure the distance of the time series, DTW, which has been verified to be an effective method, is employed as the distance metric. The encoder is a Neural Network for mapping the time series data from the DTW hidden space to the Euclidean deep feature space, and the decoder is used to map the deep feature space back to the DTW hidden space.
arXiv Detail & Related papers (2021-08-23T01:46:24Z)
Sequential Place Learning: Heuristic-Free High-Performance Long-Term Place Recognition [24.70946979449572]
We develop a learning-based CNN+LSTM architecture, trainable via backpropagation through time, for viewpoint- and appearance-invariant place recognition. Our model outperforms 15 classical methods while setting new state-of-the-art performance standards. In addition, we show that SPL can be up to 70x faster to deploy than classical methods on a 729 km route.
arXiv Detail & Related papers (2021-03-02T22:57:43Z)
Understanding Image Retrieval Re-Ranking: A Graph Neural Network Perspective [52.96911968968888]
In this paper, we demonstrate that re-ranking can be reformulated as a high-parallelism Graph Neural Network (GNN) function. On the Market-1501 dataset, we accelerate the re-ranking processing from 89.2s to 9.4ms with one K40m GPU, facilitating the real-time post-processing.
arXiv Detail & Related papers (2020-12-14T15:12:36Z)
Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling. Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z)
SUPER: A Novel Lane Detection System [26.417172945374364]
We propose a real-time lane detection system, called Scene Understanding Physics-Enhanced Real-time (SUPER) algorithm. We train the proposed system using heterogeneous data from Cityscapes, Vistas and Apollo, and evaluate the performance on four completely separate datasets. Preliminary test results show promising real-time lane-detection performance compared with the Mobileye.
arXiv Detail & Related papers (2020-05-14T21:40:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.