Related papers: SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

URL: http://arxiv.org/abs/2312.10115v2
Date: Fri, 22 Mar 2024 16:46:36 GMT
Title: SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
Authors: Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, Yansheng Li,
Abstract summary: We present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing dataset with 21.5 million temporal sequences. To our best knowledge, SkySense is the largest Multi-Modal to date, whose modules can be flexibly combined or used individually to accommodate various tasks.
Score: 35.550999964460466
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.

Related papers

SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing [32.58127653020506]
Multi-modal remote sensing foundation model (MM-RSFM) has significantly advanced various Earth observation tasks.<n>We present SkySense V2, a unified MM-RSFM that employs a single transformer backbone to handle multiple modalities.
arXiv Detail & Related papers (2025-07-18T10:44:22Z)
TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z)
AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping [11.187551725609099]
Transformer-based remote sensing foundation models (RSFMs) offer potential for crop mapping due to their ability for unified processing.<n>We present AgriFM, a multi-temporal remote sensing foundation model specifically designed for agricultural crop mapping.
arXiv Detail & Related papers (2025-05-27T15:50:14Z)
RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation [24.48561340129571]
RingMoE is a unified RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. It has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.
arXiv Detail & Related papers (2025-04-04T04:47:54Z)
SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models [0.0]
Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images. This research proposes SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model.
arXiv Detail & Related papers (2025-02-01T14:07:21Z)
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z)
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues [46.601134018876955]
We introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data. EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks. Our experimental results on 44 downstream datasets demonstrate that EarthDial outperforms existing generic and domain-specific models.
arXiv Detail & Related papers (2024-12-19T18:57:13Z)
Foundation Models for Remote Sensing and Earth Observation: A Survey [101.77425018347557]
This survey systematically reviews the emerging field of Remote Sensing Foundation Models (RSFMs) It begins with an outline of their motivation and background, followed by an introduction of their foundational concepts. We benchmark these models against publicly available datasets, discuss existing challenges, and propose future research directions.
arXiv Detail & Related papers (2024-10-22T01:08:21Z)
Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data [0.08192907805418582]
This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (ViT) backbone. The other branch captures complex-temporal dynamics from the Sentinel-2 satellite imageMax time series using a U-ViNet with Temporal Attention (U-TAE)
arXiv Detail & Related papers (2024-10-01T07:50:37Z)
SpectralEarth: Training Hyperspectral Foundation Models at Scale [47.93167977587301]
We introduce SpectralEarth, a large-scale multi-temporal dataset designed to pretrain hyperspectral foundation models. We pretrain a series of foundation models on SpectralEarth using state-of-the-art self-supervised learning (SSL) algorithms. We construct four downstream datasets for land-cover and crop-type mapping, providing benchmarks for model evaluation.
arXiv Detail & Related papers (2024-08-15T22:55:59Z)
MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning [9.540487697801531]
MMEarth is a diverse multi-modal pretraining dataset at global scale. We propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images.
arXiv Detail & Related papers (2024-05-04T23:16:48Z)
SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery [1.6180992915701702]
We extend the SwinE model to integrate temporal information for satellite time-series data. The architecture employs a hierarchical 3D Masked Autoencoder (MAE) with Video Swin Transformer blocks. Our approach shows significant performance improvements over existing state-of-the-art foundation models for all the evaluated downstream tasks.
arXiv Detail & Related papers (2024-05-03T22:55:56Z)
SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z)
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z)
Diffusion Models for Interferometric Satellite Aperture Radar [73.01013149014865]
Probabilistic Diffusion Models (PDMs) have recently emerged as a very promising class of generative models. Here, we leverage PDMs to generate several radar-based satellite image datasets. We show that PDMs succeed in generating images with complex and realistic structures, but that sampling time remains an issue.
arXiv Detail & Related papers (2023-08-31T16:26:17Z)
SEnSeI: A Deep Learning Module for Creating Sensor Independent Cloud Masks [0.7340845393655052]
We introduce a novel neural network architecture -- Spectral ENcoder for SEnsor Independence (SEnSeI) We focus on the problem of cloud masking, using several pre-existing datasets, and a new, freely available dataset for Sentinel-2. Our model is shown to achieve state-of-the-art performance on the satellites it was trained on (Sentinel-2 and Landsat 8), and is able to extrapolate to sensors it has not seen during training such as Landsat 7, Per'uSat-1, and Sentinel-3 SLSTR.
arXiv Detail & Related papers (2021-11-16T10:47:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.