Fewshot learning on global multimodal embeddings for earth observation
tasks
- URL: http://arxiv.org/abs/2310.00119v2
- Date: Sun, 3 Dec 2023 00:14:20 GMT
- Title: Fewshot learning on global multimodal embeddings for earth observation
tasks
- Authors: Matt Allen, Francisco Dorr, Joseph A. Gallego-Mejia, Laura
Mart\'inez-Ferrer, Anna Jungbluth, Freddie Kalaitzis, Ra\'ul Ramos-Poll\'an
- Abstract summary: We pretrain a CLIP/ViT based model using three different modalities of satellite imagery covering over 10% of Earth's total landmass.
We use the embeddings produced for each modality with a classical machine learning method to attempt different downstream tasks for earth observation.
We visually show that this embedding space, obtained with no labels, is sensible to the different earth features represented by the labelled datasets we selected.
- Score: 5.057850174013128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work we pretrain a CLIP/ViT based model using three different
modalities of satellite imagery across five AOIs covering over ~10\% of Earth's
total landmass, namely Sentinel 2 RGB optical imagery, Sentinel 1 SAR radar
amplitude and interferometric coherence. This model uses $\sim 250$ M
parameters. Then, we use the embeddings produced for each modality with a
classical machine learning method to attempt different downstream tasks for
earth observation related to vegetation, built up surface, croplands and
permanent water. We consistently show how we reduce the need for labeled data
by 99\%, so that with ~200-500 randomly selected labeled examples (around
4K-10K km$^2$) we reach performance levels analogous to those achieved with the
full labeled datasets (about 150K image chips or 3M km$^2$ in each area of
interest - AOI) on all modalities, AOIs and downstream tasks. This leads us to
think that the model has captured significant earth features useful in a wide
variety of scenarios. To enhance our model's usability in practice, its
architecture allows inference in contexts with missing modalities and even
missing channels within each modality. Additionally, we visually show that this
embedding space, obtained with no labels, is sensible to the different earth
features represented by the labelled datasets we selected.
Related papers
- MTGS: Multi-Traversal Gaussian Splatting [51.22657444433942]
Multi-traversal data provides multiple viewpoints for scene reconstruction within a road block.
We propose Multi-Traversal Gaussian Splatting (MTGS), a novel approach that reconstructs high-quality driving scenes from arbitrarily collected multi-traversal data.
Our results demonstrate that MTGS improves LPIPS by 23.5% and geometry accuracy by 46.3% compared to single-traversal baselines.
arXiv Detail & Related papers (2025-03-16T15:46:12Z) - EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks.
The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic.
Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z) - AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities [5.767156832161819]
We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and scale-adaptive spatial encoders.
To demonstrate the advantages of this unified approach, we compile GeoPlex, a collection of $5$ multimodal datasets.
We then train a single powerful model on these diverse datasets simultaneously.
arXiv Detail & Related papers (2024-12-18T18:11:53Z) - SpectralEarth: Training Hyperspectral Foundation Models at Scale [47.93167977587301]
We introduce SpectralEarth, a large-scale multi-temporal dataset designed to pretrain hyperspectral foundation models.
We pretrain a series of foundation models on SpectralEarth using state-of-the-art self-supervised learning (SSL) algorithms.
We construct four downstream datasets for land-cover and crop-type mapping, providing benchmarks for model evaluation.
arXiv Detail & Related papers (2024-08-15T22:55:59Z) - MLMT-CNN for Object Detection and Segmentation in Multi-layer and Multi-spectral Images [4.2623421577291225]
We present a multi-task deep learning framework that exploits the dependencies between image bands to produce 3D AR localisation.
Our framework achieves an average of 0.72 IoU (segmentation) and 0.90 F1 score (detection) across all modalities.
arXiv Detail & Related papers (2024-07-19T17:21:53Z) - M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and Multispectral Data [1.4053129774629076]
M3LEO is a multi-modal, multi-label Earth observation dataset.
It spans approximately 17M 4x4 km data chips from six diverse geographic regions.
arXiv Detail & Related papers (2024-06-06T16:30:41Z) - OmniSat: Self-Supervised Modality Fusion for Earth Observation [5.767156832161819]
We introduce OmniSat, a novel architecture able to merge diverse EO modalities into expressive features without labels.
As demonstrated for three downstream tasks, OmniSat can learn rich representations without supervision, leading to state-of-the-art performances.
Our multimodal pretraining scheme improves performance even when only one modality is available for inference.
arXiv Detail & Related papers (2024-04-12T09:31:55Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Sketch and Scale: Geo-distributed tSNE and UMAP [75.44887265789056]
Running machine learning analytics over geographically distributed datasets is a rapidly arising problem.
We introduce a novel framework: Sketch and Scale (SnS)
It leverages a Count Sketch data structure to compress the data on the edge nodes, aggregates the reduced size sketches on the master node, and runs vanilla tSNE or UMAP on the summary.
We show this technique to be fully parallel, scale linearly in time, logarithmically in memory, and communication, making it possible to analyze datasets with many millions, potentially billions of data points, spread across several data centers around the globe.
arXiv Detail & Related papers (2020-11-11T22:32:21Z) - Campus3D: A Photogrammetry Point Cloud Benchmark for Hierarchical
Understanding of Outdoor Scene [76.4183572058063]
We present a richly-annotated 3D point cloud dataset for multiple outdoor scene understanding tasks.
The dataset has been point-wisely annotated with both hierarchical and instance-based labels.
We formulate a hierarchical learning problem for 3D point cloud segmentation and propose a measurement evaluating consistency across various hierarchies.
arXiv Detail & Related papers (2020-08-11T19:10:32Z) - A Nearest Neighbor Network to Extract Digital Terrain Models from 3D
Point Clouds [1.6249267147413524]
We present an algorithm that operates on 3D-point clouds and estimates the underlying DTM for the scene using an end-to-end approach.
Our model learns neighborhood information and seamlessly integrates this with point-wise and block-wise global features.
arXiv Detail & Related papers (2020-05-21T15:54:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.