CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images
- URL: http://arxiv.org/abs/2506.12214v1
- Date: Fri, 13 Jun 2025 20:32:58 GMT
- Title: CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images
- Authors: Ilya Ilyankou, Natchapon Jongwiriyanurak, Tao Cheng, James Haworth,
- Abstract summary: We present a CLIP-based, multi-modal, multi-label classifier for predicting geographical context tags from landscape photos.<n>We show that combining location and title embeddings with image features improves accuracy over using image embeddings alone.
- Score: 0.5999777817331317
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present a CLIP-based, multi-modal, multi-label classifier for predicting geographical context tags from landscape photos in the Geograph dataset--a crowdsourced image archive spanning the British Isles, including remote regions lacking POIs and street-level imagery. Our approach addresses a Kaggle competition\footnote{https://www.kaggle.com/competitions/predict-geographic-context-from-landscape-photos} task based on a subset of Geograph's 8M images, with strict evaluation: exact match accuracy is required across 49 possible tags. We show that combining location and title embeddings with image features improves accuracy over using image embeddings alone. We release a lightweight pipeline\footnote{https://github.com/SpaceTimeLab/ClipTheLandscape} that trains on a modest laptop, using pre-trained CLIP image and text embeddings and a simple classification head. Predicted tags can support downstream tasks such as building location embedders for GeoAI applications, enriching spatial understanding in data-sparse regions.
Related papers
- GeoCLIP: Clip-Inspired Alignment between Locations and Images for
Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth.
Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task.
We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z) - Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z) - CSP: Self-Supervised Contrastive Spatial Pre-Training for
Geospatial-Visual Representations [90.50864830038202]
We present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images.
We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images.
CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.
arXiv Detail & Related papers (2023-05-01T23:11:18Z) - G^3: Geolocation via Guidebook Grounding [92.46774241823562]
We study explicit knowledge from human-written guidebooks that describe the salient and class-discriminative visual features humans use for geolocation.
We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locations.
Our approach substantially outperforms a state-of-the-art image-only geolocation method, with an improvement of over 5% in Top-1 accuracy.
arXiv Detail & Related papers (2022-11-28T16:34:40Z) - GAMa: Cross-view Video Geo-localization [68.33955764543465]
We focus on ground videos instead of images which provides contextual cues.
At clip-level, a short video clip is matched with corresponding aerial image and is later used to get video-level geo-localization of a long video.
Our proposed method achieves a Top-1 recall rate of 19.4% and 45.1% @1.0mile.
arXiv Detail & Related papers (2022-07-06T04:25:51Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Detector-Free Weakly Supervised Grounding by Separation [76.65699170882036]
Weakly Supervised phrase-Grounding (WSG) deals with the task of using data to learn to localize arbitrary text phrases in images.
We propose Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector.
We demonstrate a significant accuracy improvement, of up to $8.5%$ over previous DF-WSG SotA.
arXiv Detail & Related papers (2021-04-20T08:27:31Z) - Hierarchical Attention Fusion for Geo-Localization [7.544917072241684]
We introduce a hierarchical attention fusion network using multi-scale features for geo-localization.
We extract the hierarchical feature maps from a convolutional neural network (CNN) and organically fuse the extracted features for image representations.
Our training is self-supervised using adaptive weights to control the attention of feature emphasis from each hierarchical level.
arXiv Detail & Related papers (2021-02-18T07:07:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.