Application and Validation of Geospatial Foundation Model Data for the Prediction of Health Facility Programmatic Outputs -- A Case Study in Malawi
- URL: http://arxiv.org/abs/2510.25954v1
- Date: Wed, 29 Oct 2025 20:53:07 GMT
- Title: Application and Validation of Geospatial Foundation Model Data for the Prediction of Health Facility Programmatic Outputs -- A Case Study in Malawi
- Authors: Lynn Metz, Rachel Haggard, Michael Moszczynski, Samer Asbah, Chris Mwase, Patricia Khomani, Tyler Smith, Hannah Cooper, Annie Mwale, Arbaaz Muslim, Gautam Prasad, Mimi Sun, Tomer Shekel, Joydeep Paul, Anna Carter, Shravya Shetty, Dylan Green,
- Abstract summary: Geospatial Foundation Models (GeoFMs) offer a promising avenue by synthesizing diverse spatial, temporal, and behavioral data.<n>This study evaluated the predictive performance of three GeoFM embedding sources for modeling 15 routine health programmatic outputs in Malawi.
- Score: 0.7669892939042836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The reliability of routine health data in low and middle-income countries (LMICs) is often constrained by reporting delays and incomplete coverage, necessitating the exploration of novel data sources and analytics. Geospatial Foundation Models (GeoFMs) offer a promising avenue by synthesizing diverse spatial, temporal, and behavioral data into mathematical embeddings that can be efficiently used for downstream prediction tasks. This study evaluated the predictive performance of three GeoFM embedding sources - Google Population Dynamics Foundation Model (PDFM), Google AlphaEarth (derived from satellite imagery), and mobile phone call detail records (CDR) - for modeling 15 routine health programmatic outputs in Malawi, and compared their utility to traditional geospatial interpolation methods. We used XGBoost models on data from 552 health catchment areas (January 2021-May 2023), assessing performance with R2, and using an 80/20 training and test data split with 5-fold cross-validation used in training. While predictive performance was mixed, the embedding-based approaches improved upon baseline geostatistical methods in 13 of 15 (87%) indicators tested. A Multi-GeoFM model integrating all three embedding sources produced the most robust predictions, achieving average 5-fold cross validated R2 values for indicators like population density (0.63), new HIV cases (0.57), and child vaccinations (0.47) and test set R2 of 0.64, 0.68, and 0.55, respectively. Prediction was poor for prediction targets with low primary data availability, such as TB and malnutrition cases. These results demonstrate that GeoFM embeddings imbue a modest predictive improvement for select health and demographic outcomes in an LMIC context. We conclude that the integration of multiple GeoFM sources is an efficient and valuable tool for supplementing and strengthening constrained routine health information systems.
Related papers
- Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z) - Validating Emergency Department Admission Predictions Based on Local Data Through MIMIC-IV [0.0]
This study validates hospital admission prediction models initially developed using a small local dataset from a Greek hospital by leveraging the comprehensive MIMIC-IV dataset.<n>After preprocessing the MIMIC-IV data, five algorithms were evaluated: Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Random Forest (RF), Recursive Partitioning and Regression Trees (RPART) and Support Vector Machines (SVM)<n>RF demonstrated superior performance, achieving an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9999, sensitivity of 0.9997, and specificity of 0.9999 when applied to the MIMIC-
arXiv Detail & Related papers (2025-03-18T13:54:28Z) - Region-wise stacking ensembles for estimating brain-age using MRI [0.23301643766310373]
High-dimensional MRI data pose challenges to building generalizable and interpretable models.<n>We present a conceptually novel two-level stacking ensemble (SE) approach.<n>First-level predictions showed improved and more robust aging signal.
arXiv Detail & Related papers (2025-01-17T12:24:28Z) - CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE)
CRTRE applies randomize trial design principles to estimate the causal effect of association rules.
We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z) - Deep Survival Analysis from Adult and Pediatric Electrocardiograms: A Multi-center Benchmark Study [5.554864650304149]
Artificial intelligence applied to electrocardiography (AI-ECG) shows potential for mortality prediction.<n>We evaluated model design choices across three large cohorts: Beth Israel Deaconess (MIMIC-IV), Telehealth Network of Minas Gerais (Code-15), and Boston Children's Hospital (BCH)<n>We evaluated models predicting all-cause mortality, comparing horizon-based classification and deep survival methods with neural architectures including convolutional networks and transformers.
arXiv Detail & Related papers (2024-06-24T14:37:17Z) - Out-of-distribution Reject Option Method for Dataset Shift Problem in Early Disease Onset Prediction [3.382273111067512]
This paper proposes the out-of-distribution reject option for prediction (ODROP) to diminish dataset shift in real-world settings.<n>We used two real-world health checkup datasets with dataset shift, across three disease onset prediction tasks: diabetes, dyslipidemia, and hypertension.<n>In five OOD approaches, variational autoencoder method demonstrated notably higher stability and a greater improvement in AUROC.
arXiv Detail & Related papers (2024-05-30T09:14:01Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Long-term drought prediction using deep neural networks based on geospatial weather data [75.38539438000072]
High-quality drought forecasting up to a year in advance is critical for agriculture planning and insurance.
We tackle drought data by introducing an end-to-end approach that adopts a systematic end-to-end approach.
Key findings are the exceptional performance of a Transformer model, EarthFormer, in making accurate short-term (up to six months) forecasts.
arXiv Detail & Related papers (2023-09-12T13:28:06Z) - Strict baselines for Covid-19 forecasting and ML perspective for USA and
Russia [105.54048699217668]
Covid-19 allows researchers to gather datasets accumulated over 2 years and to use them in predictive analysis.
We present the results of a consistent comparative study of different types of methods for predicting the dynamics of the spread of Covid-19 based on regional data for two countries: the United States and Russia.
arXiv Detail & Related papers (2022-07-15T18:21:36Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.