Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model
via Cross-modal Alignment
- URL: http://arxiv.org/abs/2402.09816v1
- Date: Thu, 15 Feb 2024 09:31:07 GMT
- Title: Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model
via Cross-modal Alignment
- Authors: Angelos Zavras, Dimitrios Michail, Beg\"um Demir, Ioannis Papoutsis
- Abstract summary: We focus on Contrastive Language-Image Pre-training (CLIP), an open-vocabulary foundation model, which achieves high accuracy across many image classification tasks.
There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.
We propose a methodology for the purpose of aligning distinct RS imagery modalities with the visual and textual modalities of CLIP.
- Score: 2.389598109913754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Learning (DL) is undergoing a paradigm shift with the emergence of
foundation models, aptly named by their crucial, yet incomplete nature. In this
work, we focus on Contrastive Language-Image Pre-training (CLIP), an
open-vocabulary foundation model, which achieves high accuracy across many
image classification tasks and is often competitive with a fully supervised
baseline without being explicitly trained. Nevertheless, there are still
domains where zero-shot CLIP performance is far from optimal, such as Remote
Sensing (RS) and medical imagery. These domains do not only exhibit
fundamentally different distributions compared to natural images, but also
commonly rely on complementary modalities, beyond RGB, to derive meaningful
insights. To this end, we propose a methodology for the purpose of aligning
distinct RS imagery modalities with the visual and textual modalities of CLIP.
Our two-stage procedure, comprises of robust fine-tuning CLIP in order to deal
with the distribution shift, accompanied by the cross-modal alignment of a RS
modality encoder, in an effort to extend the zero-shot capabilities of CLIP. We
ultimately demonstrate our method on the tasks of RS imagery classification and
cross-modal retrieval. We empirically show that both robust fine-tuning and
cross-modal alignment translate to significant performance gains, across
several RS benchmark datasets. Notably, these enhancements are achieved without
the reliance on textual descriptions, without introducing any task-specific
parameters, without training from scratch and without catastrophic forgetting.
Related papers
- Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion [13.696706205837238]
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications.
We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints.
We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance.
arXiv Detail & Related papers (2025-02-06T17:58:59Z) - Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks.
MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization.
MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z) - Relation-Aware Meta-Learning for Zero-shot Sketch-Based Image Retrieval [89.15541654536544]
Sketch-based image retrieval (SBIR) relies on free-hand sketches to retrieve natural photos within the same class.
To address this limitation, the task has evolved into Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR)
We propose a novel framework for ZS-SBIR that employs a pair-based relation-aware quadruplet loss to bridge feature gaps.
arXiv Detail & Related papers (2024-11-28T09:35:27Z) - MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images.
In this paper, we propose a two-stage framework to tackle both discrepancies.
MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z) - Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification [60.20318058777603]
Generalizable vehicle re-identification (ReID) seeks to develop models that can adapt to unknown target domains without the need for fine-tuning or retraining.
Previous works have mainly focused on extracting domain-invariant features by aligning data distributions between source domains.
We propose a two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method to solve this unique problem.
arXiv Detail & Related papers (2024-07-10T04:06:39Z) - ARNet: Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space.
We propose an effective approach to narrow the gap between the two domains.
It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification
with Cross-Modal Retrieval [29.838375158101027]
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability.
We propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble.
X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training.
arXiv Detail & Related papers (2023-08-29T13:02:35Z) - SCMM: Calibrating Cross-modal Representations for Text-Based Person Search [43.17325362167387]
Text-Based Person Search (TBPS) is a crucial task in the Internet of Things (IoT) domain.
For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common space.
We present Sew embedding and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings.
arXiv Detail & Related papers (2023-04-05T07:50:16Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.