Related papers: When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

URL: http://arxiv.org/abs/2404.11797v1
Date: Wed, 17 Apr 2024 23:30:48 GMT
Title: When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery
Authors: Yiqun Xie, Zhihao Wang, Weiye Chen, Zhili Li, Xiaowei Jia, Yanhua Li, Ruichen Wang, Kangyang Chai, Ruohan Li, Sergii Skakun,
Abstract summary: Foundation models, i.e. very large deep learning models, have demonstrated impressive performances in various language and vision tasks. Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification.
Score: 23.464350453312584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of language models is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional machine learning (ML) and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.

Related papers

Evaluating the encoding competence of visual language models using uncommon actions [5.816389980109022]
UAIT is a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes.<n>We synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation.<n>We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning.
arXiv Detail & Related papers (2026-01-12T17:15:45Z)
Understanding the Implicit Biases of Design Choices for Time Series Foundation Models [90.894232610821]
Time series foundation models (TSFMs) are a class of potentially powerful, general-purpose tools for time series forecasting and related temporal tasks.<n>Their behavior is strongly shaped by subtle inductive biases in their design.<n>We show how these biases can be intuitive or very counterintuitive, depending on properties of the model and data.
arXiv Detail & Related papers (2025-10-22T04:42:35Z)
How to Squeeze An Explanation Out of Your Model [13.154512864498912]
This paper proposes an approach for interpretability that is model-agnostic. By including an SE block prior to the classification layer of any model, we are able to retrieve the most influential features. Results show that this new SE-based interpretability can be applied to various models in image and video/multi-modal settings.
arXiv Detail & Related papers (2024-12-06T15:47:53Z)
FRoundation: Are Foundation Models Ready for Face Recognition? [8.045296450065019]
We propose and demonstrate the adaptation of foundation models for face recognition across different levels of data availability. Our results indicate that, despite their versatility, pre-trained foundation models underperform in face recognition. Fine-tuning foundation models yields promising results, often surpassing models trained from scratch when training data is limited.
arXiv Detail & Related papers (2024-10-31T11:21:21Z)
Enabling Small Models for Zero-Shot Classification through Model Label Learning [50.68074833512999]
We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities. Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL.
arXiv Detail & Related papers (2024-08-21T09:08:26Z)
Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models [0.0]
An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility. We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements.
arXiv Detail & Related papers (2024-07-22T14:20:53Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks. We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z)
Generalized Relevance Learning Grassmann Quantization [0.0]
A popular way to model image sets is subspaces, which form a manifold called the Grassmann manifold. We extend the application of Generalized Relevance Learning Vector Quantization to deal with Grassmann manifold. We apply it to several recognition tasks including handwritten digit recognition, face recognition, activity recognition, and object recognition.
arXiv Detail & Related papers (2024-03-14T08:53:01Z)
Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
Foundation models in brief: A historical, socio-technical focus [2.5991265608180396]
Foundation models can be disruptive for future AI development by scaling up deep learning. Models achieve state-of-the-art performance on a variety of tasks in domains such as natural language processing and computer vision.
arXiv Detail & Related papers (2022-12-17T22:11:33Z)
On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z)
Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models. A collection of pretrained encoders perceive diverse modalities (such as vision, and language) We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.