DINO-Mix: Enhancing Visual Place Recognition with Foundational Vision
Model and Feature Mixing
- URL: http://arxiv.org/abs/2311.00230v2
- Date: Tue, 5 Dec 2023 09:13:53 GMT
- Title: DINO-Mix: Enhancing Visual Place Recognition with Foundational Vision
Model and Feature Mixing
- Authors: Gaoshuang Huang, Yang Zhou, Xiaofei Hu, Chenglong Zhang, Luying Zhao,
Wenjian Gan and Mingbo Hou
- Abstract summary: We propose a novel VPR architecture called DINO-Mix, which combines a foundational vision model with feature aggregation.
We experimentally demonstrate that the proposed DINO-Mix architecture significantly outperforms current state-of-the-art (SOTA) methods.
- Score: 4.053793612295086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Utilizing visual place recognition (VPR) technology to ascertain the
geographical location of publicly available images is a pressing issue for
real-world VPR applications. Although most current VPR methods achieve
favorable results under ideal conditions, their performance in complex
environments, characterized by lighting variations, seasonal changes, and
occlusions caused by moving objects, is generally unsatisfactory. In this
study, we utilize the DINOv2 model as the backbone network for trimming and
fine-tuning to extract robust image features. We propose a novel VPR
architecture called DINO-Mix, which combines a foundational vision model with
feature aggregation. This architecture relies on the powerful image feature
extraction capabilities of foundational vision models. We employ an
MLP-Mixer-based mix module to aggregate image features, resulting in globally
robust and generalizable descriptors that enable high-precision VPR. We
experimentally demonstrate that the proposed DINO-Mix architecture
significantly outperforms current state-of-the-art (SOTA) methods. In test sets
having lighting variations, seasonal changes, and occlusions (Tokyo24/7,
Nordland, SF-XL-Testv1), our proposed DINO-Mix architecture achieved Top-1
accuracy rates of 91.75%, 80.18%, and 82%, respectively. Compared with SOTA
methods, our architecture exhibited an average accuracy improvement of 5.14%.
Related papers
- BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module [11.898515581215708]
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks.
We introduce BrightVO, a novel VO model based on Transformer architecture, which performs front-end visual feature extraction.
Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness.
arXiv Detail & Related papers (2025-01-15T08:50:52Z) - Predicting Satisfied User and Machine Ratio for Compressed Images: A Unified Approach [58.71009078356928]
We create a deep learning-based model to predict Satisfied User Ratio (SUR) and Satisfied Machine Ratio (SMR) of compressed images simultaneously.
Experimental results indicate that the proposed model significantly outperforms state-of-the-art SUR and SMR prediction methods.
arXiv Detail & Related papers (2024-12-23T11:09:30Z) - Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities [88.398085358514]
Contrastive Deepfake Embeddings (CoDE) is a novel embedding space specifically designed for deepfake detection.
CoDE is trained via contrastive learning by additionally enforcing global-local similarities.
arXiv Detail & Related papers (2024-07-29T18:00:10Z) - EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition [6.996304653818122]
We present an effective approach to harness the potential of a foundation model for Visual Place Recognition.
We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting.
Our method also demonstrates exceptional robustness and generalization, setting new state-of-the-art performance.
arXiv Detail & Related papers (2024-05-28T11:24:41Z) - Learning Neural Volumetric Pose Features for Camera Localization [47.06118952014523]
We introduce a novel neural volumetric pose feature, termed PoseMap, to enhance camera localization.
Our framework leverages an Absolute Pose Regression (APR) architecture, together with an augmented NeRF module.
We demonstrate that our method achieves 14.28% and 20.51% performance gain on average in indoor and outdoor benchmark scenes.
arXiv Detail & Related papers (2024-03-19T15:01:18Z) - DGNet: Dynamic Gradient-Guided Network for Water-Related Optics Image
Enhancement [77.0360085530701]
Underwater image enhancement (UIE) is a challenging task due to the complex degradation caused by underwater environments.
Previous methods often idealize the degradation process, and neglect the impact of medium noise and object motion on the distribution of image features.
Our approach utilizes predicted images to dynamically update pseudo-labels, adding a dynamic gradient to optimize the network's gradient space.
arXiv Detail & Related papers (2023-12-12T06:07:21Z) - ExposureDiffusion: Learning to Expose for Low-light Image Enhancement [87.08496758469835]
This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model.
Our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models.
The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks.
arXiv Detail & Related papers (2023-07-15T04:48:35Z) - GAN-based Image Compression with Improved RDO Process [20.00340507091567]
We present a novel GAN-based image compression approach with improved rate-distortion optimization process.
To achieve this, we utilize the DISTS and MS-SSIM metrics to measure perceptual degeneration in color, texture, and structure.
The proposed method outperforms the existing GAN-based methods and the state-of-the-art hybrid (i.e., VVC)
arXiv Detail & Related papers (2023-06-18T03:21:11Z) - Image-specific Convolutional Kernel Modulation for Single Image
Super-resolution [85.09413241502209]
In this issue, we propose a novel image-specific convolutional modulation kernel (IKM)
We exploit the global contextual information of image or feature to generate an attention weight for adaptively modulating the convolutional kernels.
Experiments on single image super-resolution show that the proposed methods achieve superior performances over state-of-the-art methods.
arXiv Detail & Related papers (2021-11-16T11:05:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.