Related papers: Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users

Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users

URL: http://arxiv.org/abs/2511.01990v2
Date: Thu, 06 Nov 2025 02:22:11 GMT
Title: Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users
Authors: Saurabh Kaushik, Lalit Maurya, Elizabeth Tellman, ZhiJie Zhang,
Abstract summary: Foundationsal Models (GFMs) enable fast and reliable extraction of information from satellite imagery.<n>Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net.<n>We evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT, against TransNorm, U-Net, and Attention U-Net.
Score: 1.3877194435621216
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay's superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.

Related papers

Time-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement [3.7765281299298015]
We study on-device time-series analysis for gait detection in Parkinson's disease (PD) from short windows of triaxial acceleration, targeting resource-latency wearables and edge nodes.<n>We compare magnitude thresholding to three 1D CNNs for time-series analysis: a literature baseline (separable convolutions) and two ultra-light models - one purely separable and one with residual connections.
arXiv Detail & Related papers (2025-11-29T08:52:41Z)
LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer [14.684808109822386]
LC4-DViT is a framework that combines generative data creation with a deformation-aware Vision Transformer.<n>A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions to synthesize high-fidelity training images.<n>DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context.
arXiv Detail & Related papers (2025-11-27T23:56:35Z)
Habitat and Land Cover Change Detection in Alpine Protected Areas: A Comparison of AI Architectures [0.0]
We employ deep learning for change detection using long-term alpine habitat data from Gesaeuse National Park, Austria.<n>Clay v1.0 achieves 51% overall accuracy versus U-Net's 41% for multi-class habitat change, while both reach 67% for binary change detection.
arXiv Detail & Related papers (2025-10-29T12:32:28Z)
Cross-View UAV Geo-Localization with Precision-Focused Efficient Design: A Hierarchical Distillation Approach with Multi-view Refinement [47.16612614191333]
Cross-view geo-localization (CVGL) enables UAV localization by matching aerial images to geo-tagged satellite databases.<n>Existing methods rely on resource-intensive fine-grained feature extraction and alignment.<n>We propose Precision-Focused Efficient Design (PFED), a resource-efficient framework combining hierarchical knowledge transfer and multi-view representation refinement.
arXiv Detail & Related papers (2025-10-26T08:47:20Z)
Toward Onboard AI-Enabled Solutions to Space Object Detection for Space Sustainability [29.817805350971366]
This paper investigates the feasibility and effectiveness of employing vision sensors for space object detection.<n>It introduces models based on the Squeeze-and-Excitation (SE) layer, Vision Transformer (ViT) and the Generalized Efficient Layer Aggregation Network (GELAN)<n> Experimental results show that the proposed models achieve mean average precision at intersection over union threshold 0.5 (mAP50) scores of up to 0.751 and mean average precision averaged over intersection over union thresholds from 0.5 to 0.95 (mAP50:95) scores of up to 0.280.
arXiv Detail & Related papers (2025-05-03T01:56:52Z)
Time Frequency Analysis of EMG Signal for Gesture Recognition using Fine grained Features [3.9440964696313485]
This paper proposes a novel approach to EMG-based hand gesture recognition that uses fine-grained classification.<n> XMANet unifies low-level local and high level semantic cues through cross layer mutual attention among shallow to deep CNN experts.
arXiv Detail & Related papers (2025-04-20T18:51:10Z)
STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision [3.671692919685993]
We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective observations into global map perspective representations.<n>We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan.
arXiv Detail & Related papers (2025-03-11T00:38:54Z)
Automating global landslide detection with heterogeneous ensemble deep-learning classification [44.99833362998488]
Landslides threaten infrastructure, including roads, railways, buildings, and human life. Hazard-based spatial planning and early warning systems are cost-effective strategies to reduce the risk to society from landslides. Deep learning models have recently been applied for landside mapping using medium- to high-resolution satellite images as input.
arXiv Detail & Related papers (2023-09-12T10:56:16Z)
Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning [79.43940012723539]
ADCLR is a self-supervised learning framework for learning accurate and dense vision representation. Our approach achieves new state-of-the-art performance for contrastive methods.
arXiv Detail & Related papers (2023-06-23T07:38:09Z)
Vision Transformers, a new approach for high-resolution and large-scale mapping of canopy heights [50.52704854147297]
We present a new vision transformer (ViT) model optimized with a classification (discrete) and a continuous loss function. This model achieves better accuracy than previously used convolutional based approaches (ConvNets) optimized with only a continuous loss function.
arXiv Detail & Related papers (2023-04-22T22:39:03Z)
ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention [48.697458429460184]
Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
arXiv Detail & Related papers (2022-03-23T08:47:01Z)
Continental-scale land cover mapping at 10 m resolution over Europe (ELC10) [0.0]
We present a high resolution (10 m) land cover map (ELC10) of Europe based on a satellite-driven machine learning workflow. A Random Forest classification model was trained on 70K ground-truth points from the LUCAS dataset. The map achieved an overall accuracy of 90% across 8 land cover classes and could account for statistical unit land cover proportions within 3.9%.
arXiv Detail & Related papers (2021-04-22T08:24:15Z)
Neural Network Virtual Sensors for Fuel Injection Quantities with Provable Performance Specifications [71.1911136637719]
We show how provable guarantees can be naturally applied to other real world settings. We show how specific intervals of fuel injection quantities can be targeted to maximize robustness for certain ranges.
arXiv Detail & Related papers (2020-06-30T23:33:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.