DINO-Mix: Enhancing Visual Place Recognition with Foundational Vision
Model and Feature Mixing
- URL: http://arxiv.org/abs/2311.00230v2
- Date: Tue, 5 Dec 2023 09:13:53 GMT
- Title: DINO-Mix: Enhancing Visual Place Recognition with Foundational Vision
Model and Feature Mixing
- Authors: Gaoshuang Huang, Yang Zhou, Xiaofei Hu, Chenglong Zhang, Luying Zhao,
Wenjian Gan and Mingbo Hou
- Abstract summary: We propose a novel VPR architecture called DINO-Mix, which combines a foundational vision model with feature aggregation.
We experimentally demonstrate that the proposed DINO-Mix architecture significantly outperforms current state-of-the-art (SOTA) methods.
- Score: 4.053793612295086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Utilizing visual place recognition (VPR) technology to ascertain the
geographical location of publicly available images is a pressing issue for
real-world VPR applications. Although most current VPR methods achieve
favorable results under ideal conditions, their performance in complex
environments, characterized by lighting variations, seasonal changes, and
occlusions caused by moving objects, is generally unsatisfactory. In this
study, we utilize the DINOv2 model as the backbone network for trimming and
fine-tuning to extract robust image features. We propose a novel VPR
architecture called DINO-Mix, which combines a foundational vision model with
feature aggregation. This architecture relies on the powerful image feature
extraction capabilities of foundational vision models. We employ an
MLP-Mixer-based mix module to aggregate image features, resulting in globally
robust and generalizable descriptors that enable high-precision VPR. We
experimentally demonstrate that the proposed DINO-Mix architecture
significantly outperforms current state-of-the-art (SOTA) methods. In test sets
having lighting variations, seasonal changes, and occlusions (Tokyo24/7,
Nordland, SF-XL-Testv1), our proposed DINO-Mix architecture achieved Top-1
accuracy rates of 91.75%, 80.18%, and 82%, respectively. Compared with SOTA
methods, our architecture exhibited an average accuracy improvement of 5.14%.
Related papers
- Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities [88.398085358514]
Contrastive Deepfake Embeddings (CoDE) is a novel embedding space specifically designed for deepfake detection.
CoDE is trained via contrastive learning by additionally enforcing global-local similarities.
arXiv Detail & Related papers (2024-07-29T18:00:10Z) - EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition [6.996304653818122]
We propose a simple yet powerful approach to better exploit the potential of a foundation model for Visual Place Recognition.
We first demonstrate that features extracted from self-attention layers can serve as a powerful re-ranker for VPR.
We then demonstrate that a single-stage method leveraging internal ViT layers for pooling can generate global features that achieve state-of-the-art results.
arXiv Detail & Related papers (2024-05-28T11:24:41Z) - Learning Neural Volumetric Pose Features for Camera Localization [47.06118952014523]
We introduce a novel neural volumetric pose feature, termed PoseMap, to enhance camera localization.
Our framework leverages an Absolute Pose Regression (APR) architecture, together with an augmented NeRF module.
We demonstrate that our method achieves 14.28% and 20.51% performance gain on average in indoor and outdoor benchmark scenes.
arXiv Detail & Related papers (2024-03-19T15:01:18Z) - DGNet: Dynamic Gradient-Guided Network for Water-Related Optics Image
Enhancement [77.0360085530701]
Underwater image enhancement (UIE) is a challenging task due to the complex degradation caused by underwater environments.
Previous methods often idealize the degradation process, and neglect the impact of medium noise and object motion on the distribution of image features.
Our approach utilizes predicted images to dynamically update pseudo-labels, adding a dynamic gradient to optimize the network's gradient space.
arXiv Detail & Related papers (2023-12-12T06:07:21Z) - ExposureDiffusion: Learning to Expose for Low-light Image Enhancement [87.08496758469835]
This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model.
Our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models.
The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks.
arXiv Detail & Related papers (2023-07-15T04:48:35Z) - GAN-based Image Compression with Improved RDO Process [20.00340507091567]
We present a novel GAN-based image compression approach with improved rate-distortion optimization process.
To achieve this, we utilize the DISTS and MS-SSIM metrics to measure perceptual degeneration in color, texture, and structure.
The proposed method outperforms the existing GAN-based methods and the state-of-the-art hybrid (i.e., VVC)
arXiv Detail & Related papers (2023-06-18T03:21:11Z) - Image-specific Convolutional Kernel Modulation for Single Image
Super-resolution [85.09413241502209]
In this issue, we propose a novel image-specific convolutional modulation kernel (IKM)
We exploit the global contextual information of image or feature to generate an attention weight for adaptively modulating the convolutional kernels.
Experiments on single image super-resolution show that the proposed methods achieve superior performances over state-of-the-art methods.
arXiv Detail & Related papers (2021-11-16T11:05:10Z) - Domain-invariant Similarity Activation Map Contrastive Learning for
Retrieval-based Long-term Visual Localization [30.203072945001136]
In this work, a general architecture is first formulated probabilistically to extract domain invariant feature through multi-domain image translation.
And then a novel gradient-weighted similarity activation mapping loss (Grad-SAM) is incorporated for finer localization with high accuracy.
Extensive experiments have been conducted to validate the effectiveness of the proposed approach on the CMUSeasons dataset.
Our performance is on par with or even outperforms the state-of-the-art image-based localization baselines in medium or high precision.
arXiv Detail & Related papers (2020-09-16T14:43:22Z) - Two-shot Spatially-varying BRDF and Shape Estimation [89.29020624201708]
We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF.
We create a large-scale synthetic training dataset with domain-randomized geometry and realistic materials.
Experiments on both synthetic and real-world datasets show that our network trained on a synthetic dataset can generalize well to real-world images.
arXiv Detail & Related papers (2020-04-01T12:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.