SatHealth: A Multimodal Public Health Dataset with Satellite-based Environmental Factors
- URL: http://arxiv.org/abs/2506.13842v1
- Date: Mon, 16 Jun 2025 16:34:41 GMT
- Title: SatHealth: A Multimodal Public Health Dataset with Satellite-based Environmental Factors
- Authors: Yuanlong Wang, Pengqi Wang, Changchang Yin, Ping Zhang,
- Abstract summary: SatHealth is a dataset combining environmental data, satellite images, all-disease prevalences estimated from medical claims, and social determinants of health indicators.<n>We conducted experiments under two use cases with SatHealth: regional public health modeling and personal disease risk prediction.<n>Results show that living environmental information can significantly improve AI models' performance and temporal-spatial generalizability.
- Score: 8.107949747896107
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Living environments play a vital role in the prevalence and progression of diseases, and understanding their impact on patient's health status becomes increasingly crucial for developing AI models. However, due to the lack of long-term and fine-grained spatial and temporal data in public and population health studies, most existing studies fail to incorporate environmental data, limiting the models' performance and real-world application. To address this shortage, we developed SatHealth, a novel dataset combining multimodal spatiotemporal data, including environmental data, satellite images, all-disease prevalences estimated from medical claims, and social determinants of health (SDoH) indicators. We conducted experiments under two use cases with SatHealth: regional public health modeling and personal disease risk prediction. Experimental results show that living environmental information can significantly improve AI models' performance and temporal-spatial generalizability on various tasks. Finally, we deploy a web-based application to provide an exploration tool for SatHealth and one-click access to both our data and regional environmental embedding to facilitate plug-and-play utilization. SatHealth is now published with data in Ohio, and we will keep updating SatHealth to cover the other parts of the US. With the web application and published code pipeline, our work provides valuable angles and resources to include environmental data in healthcare research and establishes a foundational framework for future research in environmental health informatics.
Related papers
- An AI-driven framework for the prediction of personalised health response to air pollution [30.858937705130106]
Air pollution poses a significant threat to public health, causing or exacerbating many respiratory and cardiovascular diseases.<n>Recent advances in personal sensing have transformed the collection of behavioural and physiological data.<n>We present a novel workflow for predicting personalised health responses to pollution by integrating physiological data from wearable fitness devices with real-time environmental exposures.
arXiv Detail & Related papers (2025-05-15T17:59:07Z) - How Your Location Relates to Health: Variable Importance and Interpretable Machine Learning for Environmental and Sociodemographic Data [15.463748602675695]
Health outcomes depend on complex environmental and sociodemographic factors whose effects change over location and time.<n>We use fine-grained spatial and temporal data to study these effects, namely the MEDSAT dataset of English health, environmental, and sociodemographic information.<n>We then develop an interpretable machine learning framework based on Generalized Additive Models (GAMs) and Multiscale Geographically Weighted Regression (MGWR)<n>Our findings identify NO2 as a global predictor for asthma, hypertension, and anxiety, alongside other outcome-specific predictors related to occupation, marriage, and vegetation.
arXiv Detail & Related papers (2025-01-03T21:34:35Z) - DAMMI:Daily Activities in a Psychologically Annotated Multi-Modal IoT dataset [10.771838327042609]
The DAMMI dataset is designed to support researchers in the field.
It includes daily activity data of an elderly individual collected via home-installed sensors, smartphone data, and a wristband over 146 days.
The data collection spans significant events such as the COVID-19 pandemic, New Year's holidays, and the religious month of Ramadan.
arXiv Detail & Related papers (2024-10-05T13:26:54Z) - NFDI4Health workflow and service for synthetic data generation, assessment and risk management [0.0]
A promising solution to this challenge is synthetic data generation.
This technique creates entirely new datasets that mimic the statistical properties of real data.
In this paper, we present the workflow and different services developed in the context of Germany's National Data Infrastructure project NFDI4Health.
arXiv Detail & Related papers (2024-08-08T14:08:39Z) - A primer on synthetic health data [0.2770822269241974]
Recent advances in deep generative models have greatly expanded the potential to create realistic synthetic health datasets.
These synthetic datasets aim to preserve the characteristics, patterns, and overall scientific conclusions without disclosing patient identity or sensitive information.
However, many questions and challenges remain, including how to consistently evaluate a synthetic dataset's similarity and predictive utility.
arXiv Detail & Related papers (2024-01-31T08:13:35Z) - SatBird: Bird Species Distribution Modeling with Remote Sensing and
Citizen Science Data [68.2366021016172]
We present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird.
We also provide a dataset in Kenya representing low-data regimes.
We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks.
arXiv Detail & Related papers (2023-11-02T02:00:27Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Data-Centric Epidemic Forecasting: A Survey [56.99209141838794]
This survey delves into various data-driven methodological and practical advancements.
We enumerate the large number of epidemiological datasets and novel data streams that are relevant to epidemic forecasting.
We also discuss experiences and challenges that arise in real-world deployment of these forecasting systems.
arXiv Detail & Related papers (2022-07-19T16:15:11Z) - Label scarcity in biomedicine: Data-rich latent factor discovery
enhances phenotype prediction [102.23901690661916]
Low-dimensional embedding spaces can be derived from the UK Biobank population dataset to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics.
Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
arXiv Detail & Related papers (2021-10-12T16:25:50Z) - Health Status Prediction with Local-Global Heterogeneous Behavior Graph [69.99431339130105]
Estimation of health status can be achieved with various kinds of data streams continuously collected from wearable sensors.
We propose to model the behavior-related multi-source data streams with a local-global graph.
We take experiments on StudentLife dataset, and extensive results demonstrate the effectiveness of our proposed model.
arXiv Detail & Related papers (2021-03-23T11:10:04Z) - Challenges in biomarker discovery and biorepository for Gulf-war-disease
studies: a novel data platform solution [48.7576911714538]
We introduce a novel data platform, named ROSALIND, to overcome the challenges, foster healthy and vital collaborations and advance scientific inquiries.
We follow the principles etched in the platform name - ROSALIND stands for resource organisms with self-governed accessibility, linkability, integrability, neutrality, and dependability.
The deployment of ROSALIND in our GWI study in recent 12 months has accelerated the pace of data experiment and analysis, removed numerous error sources, and increased research quality and productivity.
arXiv Detail & Related papers (2021-02-04T20:38:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.