Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing
- URL: http://arxiv.org/abs/2505.11121v1
- Date: Fri, 16 May 2025 11:08:05 GMT
- Title: Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing
- Authors: Mathis Jürgen Adler, Leonard Hackel, Gencer Sumbul, Begüm Demir,
- Abstract summary: VLM pretraining aims to learn image and language alignments from a large number of image-text pairs.<n>We introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in remote sensing.<n>We propose two techniques: (i) non-language uniqueness and (ii) learning-based attention.
- Score: 2.4949076090402724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of foundation models through pretraining of vision-language models (VLMs) has recently attracted great attention in remote sensing (RS). VLM pretraining aims to learn image and language alignments from a large number of image-text pairs. Each pretraining image is often associated with multiple captions containing redundant information due to repeated or semantically similar phrases, resulting in increased pretraining and inference time. To overcome this, we introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in RS. Our strategy aims to extract and exploit complementary information from multiple captions per image while reducing redundancies through feature aggregation with importance weighting. To calculate adaptive importance weights for different captions of each image, we propose two techniques: (i) non-parametric uniqueness and (ii) learning-based attention. In the first technique, importance weights are calculated based on the bilingual evaluation understudy (BLEU) scores of the captions to emphasize unique sentences and reduce the influence of repetitive ones. In the second technique, importance weights are learned through an attention mechanism instead of relying on hand-crafted features. The effectiveness of the proposed WFA strategy with the two techniques is analyzed in terms of downstream performance on text-to-image retrieval in RS. Experimental results show that the proposed strategy enables efficient and effective pretraining of VLMs in RS. Based on the experimental analysis, we derive guidelines for selecting appropriate techniques depending on downstream task requirements and resource constraints. The code of this work is publicly available at https://git.tu-berlin.de/rsim/redundacy-aware-rs-vlm.
Related papers
- Adding simple structure at inference improves Vision-Language Compositionality [15.785274903236663]
In this paper, we propose to add simple structure at inference, where, given an image and a caption, we divide the image into different smaller crops.<n>We find that our approach consistently improves the performance of evaluated Vision-Language Models without any training.
arXiv Detail & Related papers (2025-06-11T13:06:25Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language [10.57079240576682]
Visual and linguistic pre-training aims to learn vision and language representations together.
Current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks.
We present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL.
arXiv Detail & Related papers (2023-04-10T05:54:03Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Contrastive learning of global and local features for medical image
segmentation with limited annotations [10.238403787504756]
A key requirement for the success of supervised deep learning is a large labeled dataset.
We propose strategies for extending the contrastive learning framework for segmentation of medical images in the semi-supervised setting.
In the limited annotation setting, the proposed method yields substantial improvements compared to other self-supervision and semi-supervised learning techniques.
arXiv Detail & Related papers (2020-06-18T13:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.