GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction
- URL: http://arxiv.org/abs/2507.06806v2
- Date: Tue, 21 Oct 2025 08:59:20 GMT
- Title: GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction
- Authors: Eya Cherif, Arthur Ouaknine, Luke A. Brown, Phuong D. Dao, Kyle R. Kovach, Bing Lu, Daniel Mederer, Hannes Feilhauer, Teja Kattenborn, David Rolnick,
- Abstract summary: We present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples.<n>We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models.<n>Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction.
- Score: 13.321623196078276
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.
Related papers
- COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design [9.278432103577925]
Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products.<n>We introduce COP-GEN, a latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions.<n>Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities.
arXiv Detail & Related papers (2026-03-03T18:31:46Z) - A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements [13.860603856120795]
Ground beetles serve as critical bioindicators of ecosystem health.<n>National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the U.S.<n>We present a dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging.<n>The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI.
arXiv Detail & Related papers (2026-01-14T18:44:54Z) - VFMF: World Modeling by Forecasting Vision Foundation Model Features [67.09340259579761]
We introduce a generative forecaster that performs autoregressive flow matching in vision foundation models feature space.<n>We show that this latent information more effectively than previously used PCA-based alternatives, both for forecasting and other applications.<n>With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities.
arXiv Detail & Related papers (2025-12-12T02:10:05Z) - FrogDeepSDM: Improving Frog Counting and Occurrence Prediction Using Multimodal Data and Pseudo-Absence Imputation [0.9537146822132906]
Species Distribution Modelling (SDM) helps predict species presence across large regions.<n>In this study, we enhance SDM accuracy for frogs (Anura) by applying deep learning and data imputation techniques.<n>Experiments show that data balancing significantly improved model performance, reducing the Mean Absolute Error (MAE) from 189 to 29 in frog counting tasks.
arXiv Detail & Related papers (2025-10-22T07:09:36Z) - Temporal-Spectral-Spatial Unified Remote Sensing Dense Prediction [62.376936772702905]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies disparate dense prediction tasks within a single architecture by conditioning the model on trainable task embeddings.
arXiv Detail & Related papers (2025-05-18T07:39:17Z) - SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology [3.743127390843568]
Self-supervised learning has enabled learning representations from unlabeled data.<n>These models are often trained on datasets biased toward areas of high human activity.<n>To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy.
arXiv Detail & Related papers (2025-04-25T10:58:44Z) - SatelliteCalculator: A Multi-Task Vision Foundation Model for Quantitative Remote Sensing Inversion [4.824120664293887]
We introduce SatelliteCalculator, the first vision foundation model for quantitative remote sensing inversion.<n>By leveraging physically defined index adapters, we automatically construct a large-scale dataset of over one million paired samples.<n> Experiments demonstrate that SatelliteCalculator achieves competitive accuracy across all tasks while significantly reducing inference cost.
arXiv Detail & Related papers (2025-04-18T03:48:04Z) - Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels.
Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z) - LITE: Modeling Environmental Ecosystems with Multimodal Large Language Models [25.047123247476016]
LITE is a large language model for environmental ecosystems modeling.
It unifies different environmental variables by transforming them into natural language descriptions and line graph images.
During this step, the incomplete features are imputed by a sparse Mixture-of-Experts framework.
arXiv Detail & Related papers (2024-04-01T15:14:07Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - SSL-SoilNet: A Hybrid Transformer-based Framework with Self-Supervised Learning for Large-scale Soil Organic Carbon Prediction [2.554658234030785]
This study introduces a novel approach that aims to learn the geographical link between multimodal features via self-supervised contrastive learning.
The proposed approach has undergone rigorous testing on two distinct large-scale datasets.
arXiv Detail & Related papers (2023-08-07T13:44:44Z) - Variational Classification [51.2541371924591]
We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders.
Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency.
We induce a chosen latent distribution, instead of the implicit assumption found in a standard softmax layer.
arXiv Detail & Related papers (2023-05-17T17:47:19Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Generative models-based data labeling for deep networks regression:
application to seed maturity estimation from UAV multispectral images [3.6868861317674524]
Monitoring seed maturity is an increasing challenge in agriculture due to climate change and more restrictive practices.
Traditional methods are based on limited sampling in the field and analysis in laboratory.
We propose a method for estimating parsley seed maturity using multispectral UAV imagery, with a new approach for automatic data labeling.
arXiv Detail & Related papers (2022-08-09T09:06:51Z) - Uncertainty Inspired RGB-D Saliency Detection [70.50583438784571]
We propose the first framework to employ uncertainty for RGB-D saliency detection by learning from the data labeling process.
Inspired by the saliency data labeling process, we propose a generative architecture to achieve probabilistic RGB-D saliency detection.
Results on six challenging RGB-D benchmark datasets show our approach's superior performance in learning the distribution of saliency maps.
arXiv Detail & Related papers (2020-09-07T13:01:45Z) - Semi-Automatic Data Annotation guided by Feature Space Projection [117.9296191012968]
We present a semi-automatic data annotation approach based on suitable feature space projection and semi-supervised label estimation.
We validate our method on the popular MNIST dataset and on images of human intestinal parasites with and without fecal impurities.
Our results demonstrate the added-value of visual analytics tools that combine complementary abilities of humans and machines for more effective machine learning.
arXiv Detail & Related papers (2020-07-27T17:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.