Performance of models for monitoring sustainable development goals from remote sensing: A three-level meta-regression
- URL: http://arxiv.org/abs/2601.06178v1
- Date: Wed, 07 Jan 2026 15:16:26 GMT
- Title: Performance of models for monitoring sustainable development goals from remote sensing: A three-level meta-regression
- Authors: Jonas Klingwort, Nina M. Leach, Joep Burger,
- Abstract summary: Machine learning (ML) is a tool to exploit remote sensing data for the monitoring and implementation of the United Nations' Sustainable Development Goals.<n>In this paper, we report on a meta-analysis to evaluate the performance of ML applied to remote sensing data to monitor SDGs.<n>Overall accuracy was the most reported performance metric. It was analyzed using double arcsine transformation and a three-level random effects model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML) is a tool to exploit remote sensing data for the monitoring and implementation of the United Nations' Sustainable Development Goals (SDGs). In this paper, we report on a meta-analysis to evaluate the performance of ML applied to remote sensing data to monitor SDGs. Specifically, we aim to 1) estimate the average performance; 2) determine the degree of heterogeneity between and within studies; and 3) assess how study features influence model performance. Using PRISMA guidelines, a search was performed across multiple academic databases to identify potentially relevant studies. A random sample of 200 was screened by three reviewers, resulting in 86 trials within 20 studies with 14 study features. Overall accuracy was the most reported performance metric. It was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the best model was 0.90 [0.86, 0.92]. There was considerable heterogeneity in model performance, 64% of which was between studies. The only significant feature was the prevalence of the majority class, which explained 61% of the between-study heterogeneity. None of the other thirteen features added value to the model. The most important contributions of this paper are the following two insights. 1) Overall accuracy is the most popular performance metric, yet arguably the least insightful. Its sensitivity to class imbalance makes it necessary to normalize it, which is far from common practice. 2) The field needs to standardize the reporting. Reporting of the confusion matrix for independent test sets is the most important ingredient for between-study comparisons of ML classifiers. These findings underscore the need for robust and comparable evaluation metrics in machine learning applications to ensure reliable and actionable insights for effective SDG monitoring and policy formulation.
Related papers
- AI Generated Text Detection [0.0]
This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures.<n>We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage.<n>Results indicate that contextual modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization.
arXiv Detail & Related papers (2026-01-07T11:18:10Z) - One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z) - A meta-analysis on the performance of machine-learning based language models for sentiment analysis [0.5243460995467893]
The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance.<n>Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization.
arXiv Detail & Related papers (2025-09-10T10:05:32Z) - Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks [2.7729041396205014]
This study evaluates the classification performance of five open-source large language models (LLMs)<n>We report precision, recall, and F1 scores with 95% confidence intervals for all model-task combinations.
arXiv Detail & Related papers (2025-03-19T12:51:52Z) - Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources [13.750202656564907]
Adverse event (AE) extraction is crucial for monitoring and analyzing the safety profiles of immunizations.
This study aims to evaluate the effectiveness of large language models (LLMs) and traditional deep learning models in AE extraction.
arXiv Detail & Related papers (2024-06-26T03:56:21Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation)
We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others.
These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.