Earth Observation Foundation Model PhilEO: Pretraining on the MajorTOM and FastTOM Datasets
- URL: http://arxiv.org/abs/2506.14765v3
- Date: Mon, 15 Sep 2025 10:03:43 GMT
- Title: Earth Observation Foundation Model PhilEO: Pretraining on the MajorTOM and FastTOM Datasets
- Authors: Nikolaos Dionelis, Jente Bosmans, Riccardo Musto, Giancarlo Paoletti, Simone Sarti, Giacomo Cascarano, Casper Fibaek, Luke Camilleri, Bertrand Le Saux, Nicolas Longépé,
- Abstract summary: We present the scaling-up of our recently proposed EO Foundation Model, PhilEO Geo-Aware U-Net, on the unlabeled 23TB dataset MajorTOM.<n>We fine-tune the models on the PhilEO Bench for road density estimation, building density pixel-wise regression, and land cover semantic segmentation.
- Score: 22.527733721726587
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Today, Earth Observation (EO) satellites generate massive volumes of data, with the Copernicus Sentinel-2 constellation alone producing approximately 1.6TB per day. To fully exploit this information, it is essential to pretrain EO Foundation Models (FMs) on large unlabeled datasets, enabling efficient fine-tuning for several different downstream tasks with minimal labeled data. In this work, we present the scaling-up of our recently proposed EO Foundation Model, PhilEO Geo-Aware U-Net, on the unlabeled 23TB dataset MajorTOM, which covers the vast majority of the Earth's surface, as well as on the specialized subset FastTOM 2TB that does not include oceans and ice. We develop and study various PhilEO model variants with different numbers of parameters and architectures. We fine-tune the models on the PhilEO Bench for road density estimation, building density pixel-wise regression, and land cover semantic segmentation, and we evaluate the performance. Our results demonstrate that for all n-shots for road density regression, the PhilEO 44M MajorTOM 23TB model outperforms PhilEO Globe 0.5TB 44M. We also show that for most n-shots for road density estimation and building density regression, PhilEO 200M FastTOM outperforms all the other models we examine. The effectiveness of both dataset and model scaling is validated using the PhilEO Bench. We also study the impact of architecture scaling, transitioning from U-Net Convolutional Neural Networks (CNN) to Vision Transformers (ViT).
Related papers
- Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation [0.0]
We propose a Metadata-aware Mixture-of-Experts Masked Autoencoder (MoE-MAE) with only 2.5M parameters.<n>The model combines sparse expert routing with geo-temporal conditioning, incorporating imagery alongside latitude/longitude and seasonal/daily cyclic encodings.<n>Despite its small size, the model competes with much larger architectures, demonstrating that metadata-aware pretraining improves transfer and label efficiency.
arXiv Detail & Related papers (2025-09-13T17:35:17Z) - Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources? [58.56306556151929]
Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute.<n>Can MoEs surpass dense architectures under strictly equal resource constraints?<n>We show that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource.
arXiv Detail & Related papers (2025-06-13T17:59:05Z) - Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation [67.23953699167274]
Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO)<n>In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery.<n>We propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance.
arXiv Detail & Related papers (2025-04-09T15:13:26Z) - Distilling foundation models for robust and efficient models in digital pathology [32.99044401004595]
We distilled a large foundation model into a smaller one, reducing the number of parameters by several orders of magnitude.<n>Our model, H0-mini, achieves nearly comparable performance to large FMs at a significantly reduced inference cost.<n>It is evaluated on several public benchmarks, achieving 3rd place on the HEST benchmark and 5th place on the EVA benchmark.
arXiv Detail & Related papers (2025-01-27T17:35:39Z) - SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z) - EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks.<n>Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer.<n>Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z) - Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications [5.1875922375491585]
Prithvi-EO-2.0 is a new geospatial foundation model that offers significant improvements over its predecessor.<n>It is trained on 4.2M global time series samples from NASA's Harmonized Landsat and Sentinel-2 data archive at 30m resolution.<n>The 600M version outperforms the previous Prithvi-EO model by 8% across a range of tasks.
arXiv Detail & Related papers (2024-12-03T17:59:50Z) - Phikon-v2, A large and public feature extractor for biomarker prediction [42.52549987351643]
We train a vision transformer using DINOv2 and publicly release one iteration of this model for further experimentation, coined Phikon-v2.
While trained on publicly available histology slides, Phikon-v2 surpasses our previously released model (Phikon) and performs on par with other histopathology foundation models (FM) trained on proprietary data.
arXiv Detail & Related papers (2024-09-13T20:12:29Z) - SpectralEarth: Training Hyperspectral Foundation Models at Scale [47.93167977587301]
We introduce SpectralEarth, a large-scale multi-temporal dataset designed to pretrain hyperspectral foundation models.
We pretrain a series of foundation models on SpectralEarth using state-of-the-art self-supervised learning (SSL) algorithms.
We construct four downstream datasets for land-cover and crop-type mapping, providing benchmarks for model evaluation.
arXiv Detail & Related papers (2024-08-15T22:55:59Z) - An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - Pretraining Billion-scale Geospatial Foundational Models on Frontier [0.16492989697868893]
Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning.
We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data.
Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy.
arXiv Detail & Related papers (2024-04-17T19:16:32Z) - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - Proprioception Is All You Need: Terrain Classification for Boreal Forests [4.703814941476968]
BorealTC is a publicly available dataset for proprioceptive-based terrain classification (TC)
Recorded with a Husky A200, our dataset contains 116 min of Inertial Measurement Unit (IMU), motor current, and wheel odometry data.
We show that the combination of two TC datasets yields a latent space that can be interpreted with the properties of the terrains.
arXiv Detail & Related papers (2024-03-25T15:42:09Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - PhilEO Bench: Evaluating Geo-Spatial Foundation Models [30.02962498304698]
This paper introduces the PhilEO Bench, a novel evaluation framework for EO Foundation Models.
The framework comprises of a testbed and a novel 400 GB Sentinel-2 dataset.
We present experiments using our framework evaluating different Foundation Models, including Prithvi and SatMAE.
arXiv Detail & Related papers (2024-01-09T09:58:42Z) - Foundation Models for Generalist Geospatial Artificial Intelligence [3.7002058945990415]
This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive data.
We have utilized this framework to create Prithvi, a transformer-based foundational model pre-trained on more than 1TB of multispectral satellite imagery.
arXiv Detail & Related papers (2023-10-28T10:19:55Z) - SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation [83.18930314027254]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.
In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X) with up to ViT-Huge as the backbone.
With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments.
arXiv Detail & Related papers (2023-09-29T17:58:06Z) - MIMIC: Masked Image Modeling with Image Correspondences [29.8154890262928]
Current methods for building effective pretraining datasets rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments.
We propose a pretraining dataset-curation approach that does not require any additional annotations.
Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale.
arXiv Detail & Related papers (2023-06-27T00:40:12Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - DIME-FM: DIstilling Multimodal and Efficient Foundation Models [72.1900621000677]
Large Vision-Language Foundation Models (VLFM) are trained on large-scale datasets of image-caption pairs.
We introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models.
The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset.
arXiv Detail & Related papers (2023-03-31T17:47:23Z) - Transforming Model Prediction for Tracking [109.08417327309937]
Transformers capture global relations with little inductive bias, allowing it to learn the prediction of more powerful target models.
We train the proposed tracker end-to-end and validate its performance by conducting comprehensive experiments on multiple tracking datasets.
Our tracker sets a new state of the art on three benchmarks, achieving an AUC of 68.5% on the challenging LaSOT dataset.
arXiv Detail & Related papers (2022-03-21T17:59:40Z) - Efficient deep learning models for land cover image classification [0.29748898344267777]
This work experiments with the BigEarthNet dataset for land use land cover (LULC) image classification.
We benchmark different state-of-the-art models, including Convolution Neural Networks, Multi-Layer Perceptrons, Visual Transformers, EfficientNets and Wide Residual Networks (WRN)
Our proposed lightweight model has an order of magnitude less trainable parameters, achieves 4.5% higher averaged f-score classification accuracy for all 19 LULC classes and is trained two times faster with respect to a ResNet50 state-of-the-art model that we use as a baseline.
arXiv Detail & Related papers (2021-11-18T00:03:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.