Related papers: Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

URL: http://arxiv.org/abs/2402.09816v2
Date: Fri, 18 Jul 2025 11:42:52 GMT
Title: Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
Authors: Angelos Zavras, Dimitrios Michail, Begüm Demir, Ioannis Papoutsis,
Abstract summary: We focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks.<n>There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.<n>We propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP.
Score: 4.682326604942316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation models. In this work, we focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks and often rivals fully supervised baselines, despite not being explicitly trained for those tasks. Nevertheless, there are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery. These domains do not only exhibit fundamentally different distributions compared to natural images, but also commonly rely on complementary modalities, beyond RGB, to derive meaningful insights. To this end, we propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP. Our two-stage procedure addresses the aforementioned distribution shift, extends the zero-shot capabilities of CLIP and enriches CLIP's shared embedding space with domain-specific knowledge. Initially, we robustly fine-tune CLIP according to the PAINT (Ilharco et al., 2022) patching protocol, in order to deal with the distribution shift. Building upon this foundation, we facilitate the cross-modal alignment of a RS modality encoder by distilling knowledge from the CLIP visual and textual encoders. We empirically show that both patching and cross-modal alignment translate to significant performance gains, across several RS imagery classification and cross-modal retrieval benchmark datasets. Notably, these enhancements are achieved without the reliance on textual descriptions, without introducing any task-specific parameters, without training from scratch and without catastrophic forgetting. We make our code implementation and weights for all experiments publicly available at https://github.com/Orion-AI-Lab/MindTheModalityGap.

Related papers

bi-modal textual prompt learning for vision-language models in remote sensing [23.747598435550504]
We present BiMoRS, a lightweight prompt learning framework tailored for remote sensing (RS) tasks.<n>BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract semantic summaries from RS images.<n>A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone.<n>We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average.
arXiv Detail & Related papers (2026-01-28T14:58:14Z)
Post-pre-training for Modality Alignment in Vision-Language Foundation Models [12.110530026601968]
This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning.<n>It aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations.
arXiv Detail & Related papers (2025-04-17T07:46:19Z)
Cross-Modal Consistency Learning for Sign Language Recognition [92.44927164283641]
Existing pre-training methods solely focus on the compact pose data. We propose a Cross-modal Consistency Learning framework (CCL- SLR) CCL- SLR learns from both RGB and pose modalities based on self-supervised pre-training.
arXiv Detail & Related papers (2025-03-16T12:34:07Z)
Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z)
CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval [13.59418209417664]
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query without training samples. We propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning.
arXiv Detail & Related papers (2025-02-28T08:12:23Z)
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion [13.696706205837238]
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance.
arXiv Detail & Related papers (2025-02-06T17:58:59Z)
DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining.<n>It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images.<n>DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z)
Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.<n>We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z)
Relation-Aware Meta-Learning for Zero-shot Sketch-Based Image Retrieval [89.15541654536544]
Sketch-based image retrieval (SBIR) relies on free-hand sketches to retrieve natural photos within the same class. To address this limitation, the task has evolved into Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) We propose a novel framework for ZS-SBIR that employs a pair-based relation-aware quadruplet loss to bridge feature gaps.
arXiv Detail & Related papers (2024-11-28T09:35:27Z)
MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. In this paper, we propose a two-stage framework to tackle both discrepancies. MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z)
Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task. MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z)
SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs) We develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z)
Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities. CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure. We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z)
Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP) SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z)
Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification [60.20318058777603]
Generalizable vehicle re-identification (ReID) seeks to develop models that can adapt to unknown target domains without the need for fine-tuning or retraining. Previous works have mainly focused on extracting domain-invariant features by aligning data distributions between source domains. We propose a two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method to solve this unique problem.
arXiv Detail & Related papers (2024-07-10T04:06:39Z)
Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space. We propose an effective approach to narrow the gap between the two domains. It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z)
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval [29.838375158101027]
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability. We propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble. X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training.
arXiv Detail & Related papers (2023-08-29T13:02:35Z)
Continual Vision-Language Representation Learning with Off-Diagonal Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z)
SCMM: Calibrating Cross-modal Representations for Text-Based Person Search [43.17325362167387]
Text-Based Person Search (TBPS) is a crucial task that enables accurate retrieval of target individuals from large-scale galleries. For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common embedding space. We present a method named Sew and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings.
arXiv Detail & Related papers (2023-04-05T07:50:16Z)
CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images. We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z)
MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
Robust Cross-Modal Representation Learning with Progressive Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets. We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA) IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors. IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.