Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model
via Cross-modal Alignment
- URL: http://arxiv.org/abs/2402.09816v1
- Date: Thu, 15 Feb 2024 09:31:07 GMT
- Title: Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model
via Cross-modal Alignment
- Authors: Angelos Zavras, Dimitrios Michail, Beg\"um Demir, Ioannis Papoutsis
- Abstract summary: We focus on Contrastive Language-Image Pre-training (CLIP), an open-vocabulary foundation model, which achieves high accuracy across many image classification tasks.
There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.
We propose a methodology for the purpose of aligning distinct RS imagery modalities with the visual and textual modalities of CLIP.
- Score: 2.389598109913754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Learning (DL) is undergoing a paradigm shift with the emergence of
foundation models, aptly named by their crucial, yet incomplete nature. In this
work, we focus on Contrastive Language-Image Pre-training (CLIP), an
open-vocabulary foundation model, which achieves high accuracy across many
image classification tasks and is often competitive with a fully supervised
baseline without being explicitly trained. Nevertheless, there are still
domains where zero-shot CLIP performance is far from optimal, such as Remote
Sensing (RS) and medical imagery. These domains do not only exhibit
fundamentally different distributions compared to natural images, but also
commonly rely on complementary modalities, beyond RGB, to derive meaningful
insights. To this end, we propose a methodology for the purpose of aligning
distinct RS imagery modalities with the visual and textual modalities of CLIP.
Our two-stage procedure, comprises of robust fine-tuning CLIP in order to deal
with the distribution shift, accompanied by the cross-modal alignment of a RS
modality encoder, in an effort to extend the zero-shot capabilities of CLIP. We
ultimately demonstrate our method on the tasks of RS imagery classification and
cross-modal retrieval. We empirically show that both robust fine-tuning and
cross-modal alignment translate to significant performance gains, across
several RS benchmark datasets. Notably, these enhancements are achieved without
the reliance on textual descriptions, without introducing any task-specific
parameters, without training from scratch and without catastrophic forgetting.
Related papers
- MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images.
In this paper, we propose a two-stage framework to tackle both discrepancies.
MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z) - Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space.
We propose an effective approach to narrow the gap between the two domains.
It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification
with Cross-Modal Retrieval [29.838375158101027]
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability.
We propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble.
X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training.
arXiv Detail & Related papers (2023-08-29T13:02:35Z) - Continual Vision-Language Representation Learning with Off-Diagonal
Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z) - SCMM: Calibrating Cross-modal Representations for Text-Based Person Search [43.17325362167387]
Text-Based Person Search (TBPS) is a crucial task that enables accurate retrieval of target individuals from large-scale galleries.
For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common embedding space.
We present a method named Sew and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings.
arXiv Detail & Related papers (2023-04-05T07:50:16Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.