Where We Are and What We're Looking At: Query Based Worldwide Image
Geo-localization Using Hierarchies and Scenes
- URL: http://arxiv.org/abs/2303.04249v1
- Date: Tue, 7 Mar 2023 21:47:58 GMT
- Title: Where We Are and What We're Looking At: Query Based Worldwide Image
Geo-localization Using Hierarchies and Scenes
- Authors: Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco
Cepeda, Mubarak Shah
- Abstract summary: We introduce an end-to-end transformer-based architecture that exploits the relationship between different geographic levels.
We achieve state of the art street level accuracy on 4 standard geo-localization datasets.
- Score: 53.53712888703834
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Determining the exact latitude and longitude that a photo was taken is a
useful and widely applicable task, yet it remains exceptionally difficult
despite the accelerated progress of other computer vision tasks. Most previous
approaches have opted to learn a single representation of query images, which
are then classified at different levels of geographic granularity. These
approaches fail to exploit the different visual cues that give context to
different hierarchies, such as the country, state, and city level. To this end,
we introduce an end-to-end transformer-based architecture that exploits the
relationship between different geographic levels (which we refer to as
hierarchies) and the corresponding visual scene information in an image through
hierarchical cross-attention. We achieve this by learning a query for each
geographic hierarchy and scene type. Furthermore, we learn a separate
representation for different environmental scenes, as different scenes in the
same location are often defined by completely different visual features. We
achieve state of the art street level accuracy on 4 standard geo-localization
datasets : Im2GPS, Im2GPS3k, YFCC4k, and YFCC26k, as well as qualitatively
demonstrate how our method learns different representations for different
visual hierarchies and scenes, which has not been demonstrated in the previous
methods. These previous testing datasets mostly consist of iconic landmarks or
images taken from social media, which makes them either a memorization task, or
biased towards certain places. To address this issue we introduce a much harder
testing dataset, Google-World-Streets-15k, comprised of images taken from
Google Streetview covering the whole planet and present state of the art
results. Our code will be made available in the camera-ready version.
Related papers
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models [40.69217368870192]
We propose a novel framework for worldwide geolocalization based on Retrieval-Augmented Generation (RAG)
G3 consists of three steps, i.e., Geo-alignment, Geo-diversification, and Geo-verification.
Experiments on two well-established datasets verify the superiority of G3 compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-05-23T15:37:06Z) - GeoCLIP: Clip-Inspired Alignment between Locations and Images for
Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth.
Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task.
We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z) - Are Local Features All You Need for Cross-Domain Visual Place
Recognition? [13.519413608607781]
Visual Place Recognition aims to predict the coordinates of an image based solely on visual clues.
Despite recent advances, recognizing the same place when the query comes from a significantly different distribution is still a major hurdle for state of the art retrieval methods.
In this work we explore whether re-ranking methods based on spatial verification can tackle these challenges.
arXiv Detail & Related papers (2023-04-12T14:46:57Z) - G^3: Geolocation via Guidebook Grounding [92.46774241823562]
We study explicit knowledge from human-written guidebooks that describe the salient and class-discriminative visual features humans use for geolocation.
We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locations.
Our approach substantially outperforms a state-of-the-art image-only geolocation method, with an improvement of over 5% in Top-1 accuracy.
arXiv Detail & Related papers (2022-11-28T16:34:40Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Where in the World is this Image? Transformer-based Geo-localization in
the Wild [48.69031054573838]
Predicting the geographic location (geo-localization) from a single ground-level RGB image taken anywhere in the world is a very challenging problem.
We propose TransLocator, a unified dual-branch transformer network that attends to tiny details over the entire image.
We evaluate TransLocator on four benchmark datasets - Im2GPS, Im2GPS3k, YFCC4k, YFCC26k and obtain 5.5%, 14.1%, 4.9%, 9.9% continent-level accuracy improvement.
arXiv Detail & Related papers (2022-04-29T03:27:23Z) - Location Sensitive Image Retrieval and Tagging [10.832389603397603]
LocSens is a model that learns to rank triplets of images, tags and coordinates by plausibility.
We present LocSens, a model that learns to rank triplets of images, tags and coordinates by plausibility, and two training strategies to balance the location influence in the final ranking.
arXiv Detail & Related papers (2020-07-07T12:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.