SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology
- URL: http://arxiv.org/abs/2504.18256v1
- Date: Fri, 25 Apr 2025 10:58:44 GMT
- Title: SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology
- Authors: Elena Plekhanova, Damien Robert, Johannes Dollinger, Emilia Arens, Philipp Brun, Jan Dirk Wegner, Niklaus Zimmermann,
- Abstract summary: Self-supervised learning has enabled learning representations from unlabeled data.<n>These models are often trained on datasets biased toward areas of high human activity.<n>To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy.
- Score: 3.743127390843568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.
Related papers
- Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation [67.23953699167274]
Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO)
In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery.
We propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance.
arXiv Detail & Related papers (2025-04-09T15:13:26Z) - Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments [57.59857784298534]
We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on video-derived images.<n>This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes.
arXiv Detail & Related papers (2025-03-06T05:13:19Z) - Trajectory World Models for Heterogeneous Environments [67.27233466954814]
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models.<n>We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity.<n>We propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context.
arXiv Detail & Related papers (2025-02-03T13:59:08Z) - EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks.<n>The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic.<n>Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z) - MiTREE: Multi-input Transformer Ecoregion Encoder for Species Distribution Modelling [2.3776390335270694]
We introduce MiTREE, a multi-input Vision-Transformer-based model with an ecoregion encoder.<n>We evaluate our model on the SatBird Summer and Winter datasets, the goal of which is to predict bird species encounter rates.
arXiv Detail & Related papers (2024-12-25T22:20:47Z) - DivShift: Exploring Domain-Specific Distribution Shifts in Large-Scale, Volunteer-Collected Biodiversity Datasets [0.0]
Large-scale, volunteer-collected datasets of community-identified natural world imagery like iNaturalist have enabled marked performance gains for fine-grained visual classification of species using machine learning methods.<n>Here we introduce Diversity Shift, a framework for quantifying the effects of domain-specific distribution shifts on machine learning model performance.<n>To diagnose the performance effects of biases specific to volunteer-collected biodiversity data, we also introduce DivShift - North American West Coast (DivShift-NAWC), a curated dataset of almost 7.5 million iNaturalist images across the western coast of North America partitioned across five types of expert-verified bias.
arXiv Detail & Related papers (2024-10-17T23:56:30Z) - Combining Observational Data and Language for Species Range Estimation [63.65684199946094]
We propose a novel approach combining millions of citizen science species observations with textual descriptions from Wikipedia.<n>Our framework maps locations, species, and text descriptions into a common space, enabling zero-shot range estimation from textual descriptions.<n>Our approach also acts as a strong prior when combined with observational data, resulting in more accurate range estimation with less data.
arXiv Detail & Related papers (2024-10-14T17:22:55Z) - Enhancing Ecological Monitoring with Multi-Objective Optimization: A Novel Dataset and Methodology for Segmentation Algorithms [17.802456388479616]
We introduce a unique semantic segmentation dataset of 6,096 high-resolution aerial images capturing indigenous and invasive grass species in Bega Valley, New South Wales, Australia.
This dataset presents a challenging task due to the overlap and distribution of grass species.
The dataset and code will be made publicly available, aiming to drive research in computer vision, machine learning, and ecological studies.
arXiv Detail & Related papers (2024-07-25T18:27:27Z) - EcoVerse: An Annotated Twitter Dataset for Eco-Relevance Classification, Environmental Impact Analysis, and Stance Detection [0.0]
EcoVerse is an annotated English Twitter dataset of 3,023 tweets spanning a wide spectrum of environmental topics.
We propose a three-level annotation scheme designed for Eco-Relevance Classification, Stance Detection, and introducing an original approach for Environmental Impact Analysis.
arXiv Detail & Related papers (2024-04-08T01:21:11Z) - SatBird: Bird Species Distribution Modeling with Remote Sensing and
Citizen Science Data [68.2366021016172]
We present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird.
We also provide a dataset in Kenya representing low-data regimes.
We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks.
arXiv Detail & Related papers (2023-11-02T02:00:27Z) - Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location.
Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets.
We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z) - Ensembles of Vision Transformers as a New Paradigm for Automated
Classification in Ecology [0.0]
We show that ensembles of Data-efficient image Transformers (DeiTs) significantly outperform the previous state of the art (SOTA)
On all the data sets we test, we achieve a new SOTA, with a reduction of the error with respect to the previous SOTA ranging from 18.48% to 87.50%.
arXiv Detail & Related papers (2022-03-03T14:16:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.