Related papers: Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images

Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images

URL: http://arxiv.org/abs/2307.15904v2
Date: Thu, 11 Apr 2024 22:39:15 GMT
Title: Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
Authors: Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs,
Abstract summary: We propose a weakly supervised approach for creating maps using free-form textual descriptions. We train a contrastive learning framework called Sat2Cap on a new large-scale dataset with 6.1M pairs of overhead and ground-level images.
Score: 12.356676398446215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a weakly supervised approach for creating maps using free-form textual descriptions. We refer to this work of creating textual maps as zero-shot mapping. Prior works have approached mapping tasks by developing models that predict a fixed set of attributes using overhead imagery. However, these models are very restrictive as they can only solve highly specific tasks for which they were trained. Mapping text, on the other hand, allows us to solve a large variety of mapping problems with minimal restrictions. To achieve this, we train a contrastive learning framework called Sat2Cap on a new large-scale dataset with 6.1M pairs of overhead and ground-level images. For a given location and overhead image, our model predicts the expected CLIP embeddings of the ground-level scenery. The predicted CLIP embeddings are then used to learn about the textual space associated with that location. Sat2Cap is also conditioned on date-time information, allowing it to model temporally varying concepts over a location. Our experimental results demonstrate that our models successfully capture ground-level concepts and allow large-scale mapping of fine-grained textual queries. Our approach does not require any text-labeled data, making the training easily scalable. The code, dataset, and models will be made publicly available.

Related papers

CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding [5.925837407110905]
We introduce CartoMapQA, a benchmark to evaluate Visual-Language Models' understanding of cartographic maps.<n>The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer.
arXiv Detail & Related papers (2025-12-03T08:25:22Z)
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis [4.611741386167832]
PyramidCLIP aims to align global and local visual features, yet it still lacks explicit modeling of inter-object relations.<n>We generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets.<n>We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods.
arXiv Detail & Related papers (2025-11-25T12:59:31Z)
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions [55.95282725491425]
PoSh is a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge.<n>PoSh is replicable, interpretable and a better proxy for human raters than existing metrics.<n>We show that PoSh achieves stronger correlations with the human judgments in DOCENT than the best open-weight alternatives.
arXiv Detail & Related papers (2025-10-21T20:30:20Z)
CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images [0.5999777817331317]
We present a CLIP-based, multi-modal, multi-label classifier for predicting geographical context tags from landscape photos.<n>We show that combining location and title embeddings with image features improves accuracy over using image embeddings alone.
arXiv Detail & Related papers (2025-06-13T20:32:58Z)
Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models [15.454856838083511]
Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning. Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps. We propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs.
arXiv Detail & Related papers (2024-09-23T18:26:19Z)
Evaluating Tool-Augmented Agents in Remote Sensing Platforms [1.8434042562191815]
Existing benchmarks assume question-answering input templates over predefined image-text data pairs. We present GeoLLM-QA, a benchmark designed to capture long sequences of verbal, visual, and click-based actions on a real UI platform.
arXiv Detail & Related papers (2024-04-23T20:37:24Z)
IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks [124.90137528319273]
In this paper, we present IMProv, a generative model that is able to in-context learn visual tasks from multimodal prompts. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output.
arXiv Detail & Related papers (2023-12-04T09:48:29Z)
UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos. It uses pretrained image and text towers, and feeds tokens to a video-text fusion model. We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z)
SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z)
Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal [97.53040662243768]
We propose a CLIP embedding module to make the network handle different weather conditions adaptively. This module integrates the sample specific weather prior extracted by CLIP image encoder together with the distribution specific information learned by a set of parameters.
arXiv Detail & Related papers (2023-06-15T10:06:13Z)
Is Cross-modal Information Retrieval Possible without Training? [4.616703548353372]
We take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem. That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image. Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval.
arXiv Detail & Related papers (2023-04-20T02:36:18Z)
Sketch-Guided Text-to-Image Diffusion Models [57.12095262189362]
We introduce a universal approach to guide a pretrained text-to-image diffusion model. Our method does not require to train a dedicated model or a specialized encoder for the task. We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
arXiv Detail & Related papers (2022-11-24T18:45:32Z)
Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision. We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z)
Synthetic Map Generation to Provide Unlimited Training Data for Historical Map Text Detection [5.872532529455414]
We propose a method to automatically generate an unlimited amount of annotated historical map images for training text detection models. We show that the state-of-the-art text detection models can benefit from the synthetic historical maps.
arXiv Detail & Related papers (2021-12-12T00:27:03Z)
ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z)
Weakly-Supervised Salient Object Detection via Scribble Annotations [54.40518383782725]
We propose a weakly-supervised salient object detection model to learn saliency from scribble labels. We present a new metric, termed saliency structure measure, to measure the structure alignment of the predicted saliency maps. Our method not only outperforms existing weakly-supervised/unsupervised methods, but also is on par with several fully-supervised state-of-the-art models.
arXiv Detail & Related papers (2020-03-17T12:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.