Related papers: Learning Sequential Descriptors for Sequence-based Visual Place Recognition

Learning Sequential Descriptors for Sequence-based Visual Place Recognition

URL: http://arxiv.org/abs/2207.03868v1
Date: Fri, 8 Jul 2022 12:52:04 GMT
Title: Learning Sequential Descriptors for Sequence-based Visual Place Recognition
Authors: Riccardo Mereu, Gabriele Trivigno, Gabriele Berton, Carlo Masone, Barbara Caputo
Abstract summary: In robotics, Visual Place Recognition is a continuous process that receives as input a video stream to produce a hypothesis of the robot's current position. This work proposes a detailed taxonomy of techniques using sequential descriptors, highlighting different mechanism to fuse the information from the individual images.
Score: 14.738954189759156
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In robotics, Visual Place Recognition is a continuous process that receives as input a video stream to produce a hypothesis of the robot's current position within a map of known places. This task requires robust, scalable, and efficient techniques for real applications. This work proposes a detailed taxonomy of techniques using sequential descriptors, highlighting different mechanism to fuse the information from the individual images. This categorization is supported by a complete benchmark of experimental results that provides evidence on the strengths and weaknesses of these different architectural choices. In comparison to existing sequential descriptors methods, we further investigate the viability of Transformers instead of CNN backbones, and we propose a new ad-hoc sequence-level aggregator called SeqVLAD, which outperforms prior state of the art on different datasets. The code is available at https://github.com/vandal-vpr/vg-transformers.

Related papers

A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios [1.2093553114715083]
A-ScoRe is an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible.
arXiv Detail & Related papers (2025-03-18T07:39:50Z)
EventTransAct: A video transformer-based framework for Event-camera based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos. In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame. In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z)
Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task. A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks. Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z)
SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation [0.0]
A novel network named SegTransVAE is proposed in this paper. SegTransVAE is built upon encoder-decoder architecture, exploiting transformer with the variational autoencoder (VAE) branch to the network. Evaluation on various recently introduced datasets shows that SegTransVAE outperforms previous methods in Dice Score and $95%$-Haudorff Distance.
arXiv Detail & Related papers (2022-01-21T08:02:55Z)
Efficient Video Transformers with Spatial-Temporal Token Selection [68.27784654734396]
We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples. Our framework achieves similar results while requiring 20% less computation.
arXiv Detail & Related papers (2021-11-23T00:35:58Z)
MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos. We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z)
HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance. Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z)
Self-Supervised Learning via multi-Transformation Classification for Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions. The representation of the video is learned in a self-supervised manner by classifying seven different transformations. We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z)
A cellular automata approach to local patterns for texture recognition [3.42658286826597]
We propose a method for texture descriptors that combines the representation power of complex objects by cellular automata with the known effectiveness of local descriptors in texture analysis. Our proposal outperforms other classical and state-of-the-art approaches, especially in the real-world problem.
arXiv Detail & Related papers (2020-07-15T03:25:51Z)
Unsupervised Learning of Video Representations via Dense Trajectory Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.