A Recipe for Improving Remote Sensing VLM Zero Shot Generalization
- URL: http://arxiv.org/abs/2503.08722v2
- Date: Mon, 17 Mar 2025 13:49:27 GMT
- Title: A Recipe for Improving Remote Sensing VLM Zero Shot Generalization
- Authors: Aviad Barzilai, Yotam Gigi, Amr Helmy, Vered Silverman, Yehonathan Refael, Bolous Jaber, Tomer Shekel, George Leifman, Genady Beryozkin,
- Abstract summary: We present two novel image-caption datasets for training of remote sensing foundation models.<n>The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps.<n>The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain.
- Score: 0.4427533728730559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models have had a significant impact across various AI applications, enabling use cases that were previously impossible. Contrastive Visual Language Models (VLMs), in particular, have outperformed other techniques in many tasks. However, their prevalence in remote sensing (RS) is still limited, due to the scarcity of diverse remote-sensing visual-language datasets. In this work we introduce two novel image-caption datasets for training of remote sensing foundation models. The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps. The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain, resulting in a diverse dataset with greater breadth in image styles and subject matter. These datasets are used to pre-train the MaMMUT~\citep{kuo2023mammutsimplearchitecturejoint} VLM architecture, resulting in state-of-the-art generalization performance in zero-shot cross-modal retrieval on well-known public benchmarks. Finally, we present our ongoing research to distill image-level knowledge gained in the VLM contrastive training procedure to enhance the model's localization ability. Specifically, we iteratively generate pseudo-labels for image regions based on the model's attention maps and use these labels for further training. To mitigate noisy attention maps and create robust segmentation masks, we introduce a novel attention-pooling mechanism called the Smooth-Attention-Operation.
Related papers
- AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images [21.294581646546124]
AerOSeg is a novel Open-Vocabulary (OVS) approach for remote sensing data.
We compute robust image-text correlation features using rotated versions of the input image and domain-specific prompts.
Inspired by the success of the Segment Anything Model (SAM) in diverse domains, we leverage SAM features to guide the spatial refinement of correlation features.
We enhance the refined correlation features using a multi-scale attention-aware composition to produce the final segmentation map.
arXiv Detail & Related papers (2025-04-12T13:06:46Z) - SpecDM: Hyperspectral Dataset Synthesis with Pixel-level Semantic Annotations [27.391859339238906]
In this paper, we explore the potential of generative diffusion model in synthesizing hyperspectral images with pixel-level annotations.<n>To the best of our knowledge, it is the first work to generate high-dimensional HSIs with annotations.<n>We select two of the most widely used dense prediction tasks: semantic segmentation and change detection, and generate datasets suitable for these tasks.
arXiv Detail & Related papers (2025-02-24T11:13:37Z) - Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z) - A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization [22.725542948364357]
We argue that the basic binary forgery mask is inadequate for explaining model predictions.<n>In this study, we generate salient region-focused interpretation for the forgery images.<n>We develop ForgeryTalker, an architecture designed for concurrent forgery localization and interpretation.
arXiv Detail & Related papers (2024-12-27T15:23:39Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks.
To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing [14.79627534702196]
We construct a vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
arXiv Detail & Related papers (2023-12-20T09:19:48Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Semantic Segmentation with Generative Models: Semi-Supervised Learning
and Strong Out-of-Domain Generalization [112.68171734288237]
We propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels.
We learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images.
We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization.
arXiv Detail & Related papers (2021-04-12T21:41:25Z) - X-ModalNet: A Semi-Supervised Deep Cross-Modal Network for
Classification of Remote Sensing Data [69.37597254841052]
We propose a novel cross-modal deep-learning framework called X-ModalNet.
X-ModalNet generalizes well, owing to propagating labels on an updatable graph constructed by high-level features on the top of the network.
We evaluate X-ModalNet on two multi-modal remote sensing datasets (HSI-MSI and HSI-SAR) and achieve a significant improvement in comparison with several state-of-the-art methods.
arXiv Detail & Related papers (2020-06-24T15:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.