Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation
- URL: http://arxiv.org/abs/2503.15969v1
- Date: Thu, 20 Mar 2025 09:13:31 GMT
- Title: Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation
- Authors: Clive Tinashe Marimo, Benedikt Blumenstiel, Maximilian Nitsche, Johannes Jakubik, Thomas Brunschwiler,
- Abstract summary: We introduce Llama3-MS-CLIP, the first vision-language model pre-trained with contrastive learning on a large-scale multispectral dataset.<n>We present the largest-to-date image-caption dataset for multispectral data, consisting of one million Sentinel-2 samples.<n>We evaluate Llama3-MS-CLIP on multispectral zero-shot image classification and retrieval using three datasets of varying complexity.
- Score: 3.4719449211802456
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision-language models for Earth observation (EO) typically rely on the visual spectrum of data as the only model input, thus failing to leverage the rich spectral information available in the multispectral channels recorded by satellites. Therefore, in this paper, we introduce Llama3-MS-CLIP, the first vision-language model pre-trained with contrastive learning on a large-scale multispectral dataset and report on the performance gains due to the extended spectral range. Furthermore, we present the largest-to-date image-caption dataset for multispectral data, consisting of one million Sentinel-2 samples and corresponding textual descriptions generated with Llama3-LLaVA-Next and Overture Maps data. We develop a scalable captioning pipeline, which is validated by domain experts. We evaluate Llama3-MS-CLIP on multispectral zero-shot image classification and retrieval using three datasets of varying complexity. Our results demonstrate that Llama3-MS-CLIP significantly outperforms other RGB-based approaches, improving classification accuracy by 6.77% on average and retrieval performance by 4.63% mAP compared to the second-best model. Our results emphasize the relevance of multispectral vision-language learning. We release the image-caption dataset, code, and model weights under an open-source license.
Related papers
- LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes [55.33167217384738]
LiMoE is a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning.
Our approach consists of three stages: Image-to-LiDAR Pretraining, Contrastive Mixture Learning (CML), and Semantic Mixture Supervision (SMS)
arXiv Detail & Related papers (2025-01-07T18:59:58Z) - SpectralEarth: Training Hyperspectral Foundation Models at Scale [47.93167977587301]
We introduce SpectralEarth, a large-scale multi-temporal dataset designed to pretrain hyperspectral foundation models.
We pretrain a series of foundation models on SpectralEarth using state-of-the-art self-supervised learning (SSL) algorithms.
We construct four downstream datasets for land-cover and crop-type mapping, providing benchmarks for model evaluation.
arXiv Detail & Related papers (2024-08-15T22:55:59Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - Learning transformer-based heterogeneously salient graph representation for multimodal remote sensing image classification [42.15709954199397]
A transformer-based heterogeneously salient graph representation (THSGR) approach is proposed in this paper.
First, a multimodal heterogeneous graph encoder is presented to encode distinctively non-Euclidean structural features from heterogeneous data.
A self-attention-free multi-convolutional modulator is designed for effective and efficient long-term dependency modeling.
arXiv Detail & Related papers (2023-11-17T04:06:20Z) - SpectralGPT: Spectral Remote Sensing Foundation Model [60.023956954916414]
A universal RS foundation model, named SpectralGPT, is purpose-built to handle spectral RS images using a novel 3D generative pretrained transformer (GPT)
Compared to existing foundation models, SpectralGPT accommodates input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, enabling full utilization of extensive RS big data.
Our evaluation highlights significant performance improvements with pretrained SpectralGPT models, signifying substantial potential in advancing spectral RS big data applications within the field of geoscience.
arXiv Detail & Related papers (2023-11-13T07:09:30Z) - ViLLA: Fine-Grained Vision-Language Representation Learning from
Real-World Data [8.905439446173503]
Vision-language models (VLMs) are generally trained on datasets consisting of image-caption pairs obtained from the web.
Real-world multimodal datasets, such as healthcare data, are significantly more complex.
ViLLA is trained to capture fine-grained region-attribute relationships from complex datasets.
arXiv Detail & Related papers (2023-08-22T05:03:09Z) - A generic self-supervised learning (SSL) framework for representation
learning from spectra-spatial feature of unlabeled remote sensing imagery [4.397725469518669]
Self-supervised learning (SSL) enables the models to learn a representation from orders of magnitude more unlabelled data.
This work has designed a novel SSL framework that is capable of learning representation from both spectra-spatial information of unlabelled data.
arXiv Detail & Related papers (2023-06-27T23:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.