Distillation Improves Visual Place Recognition for Low-Quality Queries
- URL: http://arxiv.org/abs/2310.06906v1
- Date: Tue, 10 Oct 2023 18:03:29 GMT
- Title: Distillation Improves Visual Place Recognition for Low-Quality Queries
- Authors: Anbang Yang, Yao Wang, John-Ross Rizzo, Chen Feng
- Abstract summary: Streaming query images/videos to a server for visual place recognition can result in reduced resolution or increased quantization.
We present a method that uses high-quality queries only during training to distill better feature representations for deep-learning-based VPR.
We achieve notable VPR recall-rate improvements over low-quality queries, as demonstrated in our experimental results.
- Score: 11.383202263053379
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The shift to online computing for real-time visual localization often
requires streaming query images/videos to a server for visual place recognition
(VPR), where fast video transmission may result in reduced resolution or
increased quantization. This compromises the quality of global image
descriptors, leading to decreased VPR performance. To improve the low recall
rate for low-quality query images, we present a simple yet effective method
that uses high-quality queries only during training to distill better feature
representations for deep-learning-based VPR, such as NetVLAD. Specifically, we
use mean squared error (MSE) loss between the global descriptors of queries
with different qualities, and inter-channel correlation knowledge distillation
(ICKD) loss over their corresponding intermediate features. We validate our
approach using the both Pittsburgh 250k dataset and our own indoor dataset with
varying quantization levels. By fine-tuning NetVLAD parameters with our
distillation-augmented losses, we achieve notable VPR recall-rate improvements
over low-quality queries, as demonstrated in our extensive experimental
results. We believe this work not only pushes forward the VPR research but also
provides valuable insights for applications needing dependable place
recognition under resource-limited conditions.
Related papers
- DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild [54.139923409101044]
We propose a novel IQA method called diffusion priors-based IQA (DP-IQA)
We use pre-trained stable diffusion as the backbone, extract multi-level features from the denoising U-Net, and decode them to estimate the image quality score.
We distill the knowledge in the above model into a CNN-based student model, significantly reducing the parameter to enhance applicability.
arXiv Detail & Related papers (2024-05-30T12:32:35Z) - EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition [6.996304653818122]
We propose a simple yet powerful approach to better exploit the potential of a foundation model for Visual Place Recognition.
We first demonstrate that features extracted from self-attention layers can serve as a powerful re-ranker for VPR.
We then demonstrate that a single-stage method leveraging internal ViT layers for pooling can generate global features that achieve state-of-the-art results.
arXiv Detail & Related papers (2024-05-28T11:24:41Z) - Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment [49.36799270585947]
No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference.
We propose a novel contrastive pre-training framework tailored for PCQA (CoPA)
Our method outperforms the state-of-the-art PCQA methods on popular benchmarks.
arXiv Detail & Related papers (2024-03-15T07:16:07Z) - Investigating Prompting Techniques for Zero- and Few-Shot Visual
Question Answering [7.640416680391081]
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance.
We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection.
To mitigate the challenges associated with evaluating free-form open-ended VQA responses, we introduce a straightforward LLM-guided pre-processing technique.
arXiv Detail & Related papers (2023-06-16T17:47:57Z) - Neighbourhood Representative Sampling for Efficient End-to-end Video
Quality Assessment [60.57703721744873]
The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA)
In this work, we propose a unified scheme, spatial-temporal grid mini-cube sampling (St-GMS) to get a novel type of sample, named fragments.
With fragments and FANet, the proposed efficient end-to-end FAST-VQA and FasterVQA achieve significantly better performance than existing approaches on all VQA benchmarks.
arXiv Detail & Related papers (2022-10-11T11:38:07Z) - FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment
Sampling [54.31355080688127]
Current deep video quality assessment (VQA) methods are usually with high computational costs when evaluating high-resolution videos.
We propose Grid Mini-patch Sampling (GMS), which allows consideration of local quality by sampling patches at their raw resolution.
We build the Fragment Attention Network (FANet) specially designed to accommodate fragments as inputs.
FAST-VQA improves state-of-the-art accuracy by around 10% while reducing 99.5% FLOPs on 1080P high-resolution videos.
arXiv Detail & Related papers (2022-07-06T11:11:43Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Learning Transformer Features for Image Quality Assessment [53.51379676690971]
We propose a unified IQA framework that utilizes CNN backbone and transformer encoder to extract features.
The proposed framework is compatible with both FR and NR modes and allows for a joint training scheme.
arXiv Detail & Related papers (2021-12-01T13:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.