NYC-Indoor-VPR: A Long-Term Indoor Visual Place Recognition Dataset with Semi-Automatic Annotation
- URL: http://arxiv.org/abs/2404.00504v1
- Date: Sun, 31 Mar 2024 00:20:53 GMT
- Title: NYC-Indoor-VPR: A Long-Term Indoor Visual Place Recognition Dataset with Semi-Automatic Annotation
- Authors: Diwei Sheng, Anbang Yang, John-Ross Rizzo, Chen Feng,
- Abstract summary: This paper introduces the NYC-Indoor-VPR dataset, a unique and rich collection of over 36,000 images compiled from 13 distinct crowded scenes in New York City.
To establish the ground truth for VPR, we propose a semiautomatic annotation approach that computes the positional information of each image.
Our method specifically takes pairs of videos as input and yields matched pairs of images along with their estimated relative locations.
- Score: 7.037667953803237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Place Recognition (VPR) in indoor environments is beneficial to humans and robots for better localization and navigation. It is challenging due to appearance changes at various frequencies, and difficulties of obtaining ground truth metric trajectories for training and evaluation. This paper introduces the NYC-Indoor-VPR dataset, a unique and rich collection of over 36,000 images compiled from 13 distinct crowded scenes in New York City taken under varying lighting conditions with appearance changes. Each scene has multiple revisits across a year. To establish the ground truth for VPR, we propose a semiautomatic annotation approach that computes the positional information of each image. Our method specifically takes pairs of videos as input and yields matched pairs of images along with their estimated relative locations. The accuracy of this matching is refined by human annotators, who utilize our annotation software to correlate the selected keyframes. Finally, we present a benchmark evaluation of several state-of-the-art VPR algorithms using our annotated dataset, revealing its challenge and thus value for VPR research.
Related papers
- Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition [9.192660643226372]
We formulate how limitations in the Geographic Distance Sensitivity of current VPR embeddings result in a high probability of incorrectly sorting the top-k retrievals.
We propose a novel mining strategy, CliqueMining, that selects positive and negative examples by sampling cliques from a graph of visually similar images.
Our approach boosts the sensitivity of VPR embeddings at small distance ranges, significantly improving the state of the art on relevant benchmarks.
arXiv Detail & Related papers (2024-07-02T16:49:01Z) - Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP.
VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone.
Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Deep Homography Estimation for Visual Place Recognition [49.235432979736395]
We propose a transformer-based deep homography estimation (DHE) network.
It takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification.
Experiments on benchmark datasets show that our method can outperform several state-of-the-art methods.
arXiv Detail & Related papers (2024-02-25T13:22:17Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - CoPR: Towards Accurate Visual Localization With Continuous
Place-descriptor Regression [2.7393821783237184]
Visual Place Recognition (VPR) estimates the camera location of a query image by retrieving the most similar reference image from a map of geo-tagged reference images.
References for VPR are only available at sparse poses in a map, which enforces an upper bound on the maximum achievable localization accuracy.
We propose Continuous Place-descriptor Regression (CoPR) to densify the map and improve localization accuracy.
arXiv Detail & Related papers (2023-04-14T23:17:44Z) - Self-Supervised Place Recognition by Refining Temporal and Featural Pseudo Labels from Panoramic Data [16.540900776820084]
We propose a novel framework named TF-VPR that uses temporal neighborhoods and learnable feature neighborhoods to discover unknown spatial neighborhoods.
Our method outperforms self-supervised baselines in recall rate, robustness, and heading diversity.
arXiv Detail & Related papers (2022-08-19T12:59:46Z) - NYU-VPR: Long-Term Visual Place Recognition Benchmark with View
Direction and Data Anonymization Influences [5.94860356161563]
We present a dataset of more than 200,000 images over a 2km by 2km area near the New York University campus, taken within the whole year of 2016.
We show that side views are significantly more challenging for current VPR methods while the influence of data anonymization is almost negligible.
arXiv Detail & Related papers (2021-10-18T03:56:33Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - VPR-Bench: An Open-Source Visual Place Recognition Evaluation Framework
with Quantifiable Viewpoint and Appearance Change [25.853640977526705]
VPR research has grown rapidly as a field over the past decade due to improving camera hardware and its potential for deep learning-based techniques.
This growth has led to fragmentation and a lack of standardisation in the field, especially concerning performance evaluation.
In this paper, we address these gaps through a new comprehensive open-source framework for assessing the performance of VPR techniques, dubbed "VPR-Bench"
arXiv Detail & Related papers (2020-05-17T00:27:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.