Remote Sensing Vision-Language Foundation Models without Annotations via
Ground Remote Alignment
- URL: http://arxiv.org/abs/2312.06960v1
- Date: Tue, 12 Dec 2023 03:39:07 GMT
- Title: Remote Sensing Vision-Language Foundation Models without Annotations via
Ground Remote Alignment
- Authors: Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick,
Bharath Hariharan, Kavita Bala
- Abstract summary: We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
- Score: 61.769441954135246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a method to train vision-language models for remote-sensing
images without using any textual annotations. Our key insight is to use
co-located internet imagery taken on the ground as an intermediary for
connecting remote-sensing images and language. Specifically, we train an image
encoder for remote sensing images to align with the image encoder of CLIP using
a large amount of paired internet and satellite images. Our unsupervised
approach enables the training of a first-of-its-kind large-scale vision
language model (VLM) for remote sensing images at two different resolutions. We
show that these VLMs enable zero-shot, open-vocabulary image classification,
retrieval, segmentation and visual question answering for satellite images. On
each of these tasks, our VLM trained without textual annotations outperforms
existing VLMs trained with supervision, with gains of up to 20% for
classification and 80% for segmentation.
Related papers
- Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data.
This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model.
Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
arXiv Detail & Related papers (2024-10-30T18:13:11Z) - Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?"
Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Large Language Models for Captioning and Retrieving Remote Sensing
Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z) - MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images.
Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata.
We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z) - C-SAW: Self-Supervised Prompt Learning for Image Generalization in
Remote Sensing [12.930814370829893]
We focus on domain and class generalization problems in analyzing optical remote sensing images, using the large-scale pre-trained vision-language model (VLM), CLIP.
Existing prompt learning techniques overlook the importance of incorporating domain and content information into the prompts.
We propose a solution that ensures domain-invariant prompt learning while enhancing the expressiveness of visual features.
arXiv Detail & Related papers (2023-11-27T13:35:20Z) - Towards Automatic Satellite Images Captions Generation Using Large
Language Models [0.5439020425819]
We propose Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images.
We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images.
arXiv Detail & Related papers (2023-10-17T16:45:47Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.