Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification
- URL: http://arxiv.org/abs/2409.00698v1
- Date: Sun, 1 Sep 2024 11:39:13 GMT
- Title: Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification
- Authors: Karim El Khoury, Maxime Zanella, Benoît Gérin, Tiffanie Godelaine, Benoît Macq, Saïd Mahmoudi, Christophe De Vleeschouwer, Ismail Ben Ayed,
- Abstract summary: Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining.
Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships.
Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements.
- Score: 19.850063789903846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: https://github.com/elkhouryk/RS-TransCLIP
Related papers
- Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models [64.67721492968941]
We propose a Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) framework.
Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness.
Our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques.
arXiv Detail & Related papers (2024-10-29T07:15:09Z) - Boosting Vision-Language Models for Histopathology Classification: Predict all at once [11.644118356081531]
We introduce a transductive approach to vision-language models for histo-pathology.
Our approach is highly efficient, processing $105$ patches in just a few seconds.
arXiv Detail & Related papers (2024-09-03T13:24:12Z) - The Neglected Tails in Vision-Language Models [51.79913798808725]
We show that vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts.
We propose REtrieval-Augmented Learning (REAL) to mitigate the imbalanced performance of zero-shot VLMs.
arXiv Detail & Related papers (2024-01-23T01:25:00Z) - POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained
models [62.23255433487586]
We propose an unsupervised fine-tuning framework to fine-tune the model or prompt on the unlabeled target data.
We demonstrate how to apply our method to both language-augmented vision and masked-language models by aligning the discrete distributions extracted from the prompts and target data.
arXiv Detail & Related papers (2023-04-29T22:05:22Z) - Effective End-to-End Vision Language Pretraining with Semantic Visual
Loss [58.642954383282216]
Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors.
We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy.
Compared with region feature models, our end-to-end models could achieve similar or better performance on downstream tasks and run more than 10 times faster during inference.
arXiv Detail & Related papers (2023-01-18T00:22:49Z) - Zero-Shot Text Classification with Self-Training [8.68603153534916]
We show that fine-tuning the zero-shot classifier on its most confident predictions leads to significant performance gains across a wide range of text classification tasks.
Self-training adapts the zero-shot model to the task at hand.
arXiv Detail & Related papers (2022-10-31T17:55:00Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z) - VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and
Linguistic Knowledge from Pretraining [39.24803665848558]
We propose VisualGPT, a data-efficient image captioning model that leverages the linguistic knowledge from a large pretrained language model (LM)
We designed a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the pretrained LM as the language decoder on a small amount of in-domain training data.
VisualGPT outperforms the best baseline model by up to 10.8% CIDEr on MS COCO and up to 5.4% CIDEr on Conceptual Captions.
arXiv Detail & Related papers (2021-02-20T18:02:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.