Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for
Place Recognition
- URL: http://arxiv.org/abs/2103.01486v1
- Date: Tue, 2 Mar 2021 05:53:32 GMT
- Title: Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for
Place Recognition
- Authors: Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, Tobias Fischer
- Abstract summary: This paper introduces Patch-NetVLAD, which provides a novel formulation for combining the advantages of both local and global descriptor methods.
We show that Patch-NetVLAD outperforms both global and local feature descriptor-based methods with comparable compute.
It is also adaptable to user requirements, with a speed-optimised version operating over an order of magnitude faster than the state-of-the-art.
- Score: 29.282413482297255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Place Recognition is a challenging task for robotics and autonomous
systems, which must deal with the twin problems of appearance and viewpoint
change in an always changing world. This paper introduces Patch-NetVLAD, which
provides a novel formulation for combining the advantages of both local and
global descriptor methods by deriving patch-level features from NetVLAD
residuals. Unlike the fixed spatial neighborhood regime of existing local
keypoint features, our method enables aggregation and matching of deep-learned
local features defined over the feature-space grid. We further introduce a
multi-scale fusion of patch features that have complementary scales (i.e. patch
sizes) via an integral feature space and show that the fused features are
highly invariant to both condition (season, structure, and illumination) and
viewpoint (translation and rotation) changes. Patch-NetVLAD outperforms both
global and local feature descriptor-based methods with comparable compute,
achieving state-of-the-art visual place recognition results on a range of
challenging real-world datasets, including winning the Facebook Mapillary
Visual Place Recognition Challenge at ECCV2020. It is also adaptable to user
requirements, with a speed-optimised version operating over an order of
magnitude faster than the state-of-the-art. By combining superior performance
with improved computational efficiency in a configurable framework,
Patch-NetVLAD is well suited to enhance both stand-alone place recognition
capabilities and the overall performance of SLAM systems.
Related papers
- PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models [96.76995840807615]
HiRes-LLaVA is a novel framework designed to process any size of high-resolution input without altering the original contextual and geometric information.
HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compress the vision tokens based on themselves.
arXiv Detail & Related papers (2024-07-11T17:42:17Z) - Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments [19.859565090638167]
This paper presents a novel model, called UMF, that leverages multi-modality by cross-attention blocks between vision and LiDAR features.
Experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines.
Our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset.
arXiv Detail & Related papers (2024-03-20T08:35:57Z) - Slide-Transformer: Hierarchical Vision Transformer with Local
Self-Attention [34.26177289099421]
Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT)
We propose a novel local attention module, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability.
Our module realizes the local attention paradigm in both efficient and flexible manner.
arXiv Detail & Related papers (2023-04-09T13:37:59Z) - Spatial-Aware Token for Weakly Supervised Object Localization [137.0570026552845]
We propose a task-specific spatial-aware token to condition localization in a weakly supervised manner.
Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc.
arXiv Detail & Related papers (2023-03-18T15:38:17Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - SphereVLAD++: Attention-based and Signal-enhanced Viewpoint Invariant
Descriptor [6.326554177747699]
We develop SphereVLAD++, an attention-enhanced viewpoint invariant place recognition method.
We show that SphereVLAD++ outperforms all relative state-of-the-art 3D place recognition methods under small or even totally reversed viewpoint differences.
arXiv Detail & Related papers (2022-07-06T20:32:43Z) - Cross-modal Local Shortest Path and Global Enhancement for
Visible-Thermal Person Re-Identification [2.294635424666456]
We propose the Cross-modal Local Shortest Path and Global Enhancement (CM-LSP-GE) modules,a two-stream network based on joint learning of local and global features.
The experimental results on two typical datasets show that our model is obviously superior to the most state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T10:27:22Z) - Conformer: Local Features Coupling Global Representations for Visual
Recognition [72.9550481476101]
We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning.
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
arXiv Detail & Related papers (2021-05-09T10:00:03Z) - Global Context-Aware Progressive Aggregation Network for Salient Object
Detection [117.943116761278]
We propose a novel network named GCPANet to integrate low-level appearance features, high-level semantic features, and global context features.
We show that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-02T04:26:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.