Evaluating the Role of Training Data Origin for Country-Scale Cropland Mapping in Data-Scarce Regions: A Case Study of Nigeria
- URL: http://arxiv.org/abs/2312.10872v2
- Date: Sun, 13 Jul 2025 08:05:25 GMT
- Title: Evaluating the Role of Training Data Origin for Country-Scale Cropland Mapping in Data-Scarce Regions: A Case Study of Nigeria
- Authors: Joaquin Gajardo, Michele Volpi, Daniel Onwude, Thijs Defraeye,
- Abstract summary: A key challenge is understanding how the quantity, quality, and proximity of the training data to the target region influences model performance.<n>We evaluate this in Nigeria, using 1,827 manually labelled samples covering the whole country, and subsets of the Geowiki dataset.<n>Results show local data significantly boosts performance, with accuracy gains up to 0.246 (RF) and 0.178 (LSTM)
- Score: 0.6249768559720122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cropland maps are essential for remote sensing-based agricultural monitoring, providing timely insights without extensive field surveys. Machine learning enables large-scale mapping but depends on geo-referenced ground-truth data, which is costly to collect, motivating the use of global datasets in data-scarce regions. A key challenge is understanding how the quantity, quality, and proximity of the training data to the target region influences model performance. We evaluate this in Nigeria, using 1,827 manually labelled samples covering the whole country, and subsets of the Geowiki dataset: Nigeria-only, regional (Nigeria and neighbouring countries), and global. We extract pixel-wise multi-source time series arrays from Sentinel-1, Sentinel-2, ERA5 climate, and a digital elevation model using Google Earth Engine, comparing Random Forests with LSTMs, including a lightweight multi-headed LSTM variant. Results show local data significantly boosts performance, with accuracy gains up to 0.246 (RF) and 0.178 (LSTM). Nigeria-only or regional data outperformed global data despite the lower amount of labels, with the exception of the multi-headed LSTM, which benefited from global data when local samples were absent. Sentinel-1, climate, and topographic data are critical data sources, with their removal reducing F1-score by up to 0.593. Addressing class imbalance also improved LSTM accuracy by up to 0.071. Our top-performing model (Nigeria-only LSTM) achieved an F1-score of 0.814 and accuracy of 0.842, matching the best global land cover product while offering stronger recall, critical for food security. We release code, data, maps, and an interactive web app to support future work.
Related papers
- Metadata Conditioned Large Language Models for Localization [25.913929585741034]
We show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization.<n>Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal.<n>After instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data.
arXiv Detail & Related papers (2026-01-21T18:20:59Z) - Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z) - WorldPM: Scaling Human Preference Modeling [130.23230492612214]
We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential.<n>We collect preference data from public forums covering diverse user communities.<n>We conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters.
arXiv Detail & Related papers (2025-05-15T17:38:37Z) - EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks.
The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic.
Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z) - Local vs. Global: Local Land-Use and Land-Cover Models Deliver Higher Quality Maps [3.606726772030176]
In 2023, 58.0% of the African population experienced moderate to severe food insecurity, with 21.6% facing severe food insecurity.
We propose a data-centric framework with a teacher-student model setup, which uses diverse data sources to produce local land-cover maps.
Our framework achieved higher quality maps, with improvements of 0.14 in the F1 score and 0.21 in Intersection-over-Union, compared to the best global model.
arXiv Detail & Related papers (2024-12-01T11:48:58Z) - Contrasting local and global modeling with machine learning and satellite data: A case study estimating tree canopy height in African savannas [23.868986217962373]
Small models trained only with locally-collected data outperform published global TCH maps.
We identify specific points of conflict and synergy between local and global modeling paradigms.
arXiv Detail & Related papers (2024-11-21T17:53:27Z) - Fields of The World: A Machine Learning Benchmark Dataset For Global Agricultural Field Boundary Segmentation [12.039406240082515]
Fields of The World (FTW) is a novel benchmark dataset for agricultural field instance segmentation.
FTW is an order of magnitude larger than previous datasets with 70,462 samples.
We show that models trained on FTW have better zero-shot and fine-tuning performance in held-out countries.
arXiv Detail & Related papers (2024-09-24T17:20:58Z) - Regional biases in image geolocation estimation: a case study with the SenseCity Africa dataset [0.0]
We apply a state-of-the-art image geolocation estimation model (ISNs) to a crowd-sourced dataset of geolocated images from the African continent (SCA100)
Our findings show that the ISNs model tends to over-predict image locations in high-income countries of the Western world.
Our results suggest that using IM2GPS3k as a training set and benchmark for image geolocation estimation and other computer vision models overlooks its potential application in the African context.
arXiv Detail & Related papers (2024-04-03T08:27:24Z) - Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z) - HarvestNet: A Dataset for Detecting Smallholder Farming Activity Using
Harvest Piles and Remote Sensing [50.4506590177605]
HarvestNet is a dataset for mapping the presence of farms in the Ethiopian regions of Tigray and Amhara during 2020-2023.
We introduce a new approach based on the detection of harvest piles characteristic of many smallholder systems.
We conclude that remote sensing of harvest piles can contribute to more timely and accurate cropland assessments in food insecure regions.
arXiv Detail & Related papers (2023-08-23T11:03:28Z) - Exploring the Effectiveness of Dataset Synthesis: An application of
Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection.
We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset.
Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z) - The Canadian Cropland Dataset: A New Land Cover Dataset for
Multitemporal Deep Learning Classification in Agriculture [0.8602553195689513]
temporal patch-based dataset of Canadian croplands enriched with labels retrieved from the Canadian Annual Crop Inventory.
The dataset contains 78,536 manually verified high-resolution spatial images from 10 crop classes collected over four crop production years.
As a benchmark, we provide models and source code that allow a user to predict the crop class using a single image (ResNet, DenseNet, EfficientNet) or a sequence of images (LRCN, 3D-CNN) from the same location.
arXiv Detail & Related papers (2023-05-31T18:40:15Z) - Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation
for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization.
We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain.
It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z) - DYNAFED: Tackling Client Data Heterogeneity with Global Dynamics [60.60173139258481]
Local training on non-iid distributed data results in deflected local optimum.
A natural solution is to gather all client data onto the server, such that the server has a global view of the entire data distribution.
In this paper, we put forth an idea to collect and leverage global knowledge on the server without hindering data privacy.
arXiv Detail & Related papers (2022-11-20T06:13:06Z) - End-to-end deep learning for directly estimating grape yield from
ground-based imagery [53.086864957064876]
This study demonstrates the application of proximal imaging combined with deep learning for yield estimation in vineyards.
Three model architectures were tested: object detection, CNN regression, and transformer models.
The study showed the applicability of proximal imaging and deep learning for prediction of grapevine yield on a large scale.
arXiv Detail & Related papers (2022-08-04T01:34:46Z) - Strict baselines for Covid-19 forecasting and ML perspective for USA and
Russia [105.54048699217668]
Covid-19 allows researchers to gather datasets accumulated over 2 years and to use them in predictive analysis.
We present the results of a consistent comparative study of different types of methods for predicting the dynamics of the spread of Covid-19 based on regional data for two countries: the United States and Russia.
arXiv Detail & Related papers (2022-07-15T18:21:36Z) - Using Machine Learning to generate an open-access cropland map from
satellite images time series in the Indian Himalayan Region [0.28675177318965034]
We develop an ML pipeline that relies on Sentinel-2 satellite images time series.
We generate a cropland map for three districts of Himachal Pradesh, spanning 14,600 km2, which improves the resolution and quality of existing public maps.
arXiv Detail & Related papers (2022-03-28T12:08:06Z) - Jalisco's multiclass land cover analysis and classification using a
novel lightweight convnet with real-world multispectral and relief data [51.715517570634994]
We present our novel lightweight (only 89k parameters) Convolution Neural Network (ConvNet) to make LC classification and analysis.
In this work, we combine three real-world open data sources to obtain 13 channels.
Our embedded analysis anticipates the limited performance in some classes and gives us the opportunity to group the most similar.
arXiv Detail & Related papers (2022-01-26T14:58:51Z) - Continental-Scale Building Detection from High Resolution Satellite
Imagery [5.56205296867374]
We study variations in architecture, loss functions, regularization, pre-training, self-training and post-processing that increase instance segmentation performance.
Experiments were carried out using a dataset of 100k satellite images across Africa containing 1.75M manually labelled building instances.
We report novel methods for improving performance of building detection with this type of model, including the use of mixup.
arXiv Detail & Related papers (2021-07-26T15:48:14Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.