Computationally Assisted Quality Control for Public Health Data Streams
- URL: http://arxiv.org/abs/2306.16914v2
- Date: Tue, 2 Jan 2024 23:09:07 GMT
- Title: Computationally Assisted Quality Control for Public Health Data Streams
- Authors: Ananya Joshi, Kathryn Mazaitis, Roni Rosenfeld, Bryan Wilder
- Abstract summary: FlaSH is a practical outlier detection framework for public health data users.
It uses simple, scalable models to capture statistical properties of public health streams.
It has been deployed on data streams used by public health stakeholders.
- Score: 21.056027241048152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Irregularities in public health data streams (like COVID-19 Cases) hamper
data-driven decision-making for public health stakeholders. A real-time,
computer-generated list of the most important, outlying data points from
thousands of daily-updated public health data streams could assist an expert
reviewer in identifying these irregularities. However, existing outlier
detection frameworks perform poorly on this task because they do not account
for the data volume or for the statistical properties of public health streams.
Accordingly, we developed FlaSH (Flagging Streams in public Health), a
practical outlier detection framework for public health data users that uses
simple, scalable models to capture these statistical properties explicitly. In
an experiment where human experts evaluate FlaSH and existing methods
(including deep learning approaches), FlaSH scales to the data volume of this
task, matches or exceeds these other methods in mean accuracy, and identifies
the outlier points that users empirically rate as more helpful. Based on these
results, FlaSH has been deployed on data streams used by public health
stakeholders.
Related papers
- An Analytical Approach to Privacy and Performance Trade-Offs in Healthcare Data Sharing [1.2179548969182572]
Older adults, frequently hospitalized patients, and racial minorities are vulnerable to privacy attacks.<n>We evaluate three anonymization methods-$k$-anonymity, the technique by Zheng et al., and the MO-OBAM model-based on their ability to reduce re-identification risk.
arXiv Detail & Related papers (2025-08-25T21:36:47Z) - A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage [77.83757117924995]
We propose a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release.
Our approach shows that seemingly innocuous auxiliary information can be used to infer sensitive attributes like age or substance use history from sanitized data.
arXiv Detail & Related papers (2025-04-28T01:16:27Z) - Outlier Ranking in Large-Scale Public Health Streams [17.53470381091954]
Disease control experts inspect public health data streams daily for outliers worth investigating.
We propose a new task for algorithms to rank the outputs of any univariate method applied to each of many streams.
Our novel algorithm for this task, which leverages hierarchical networks and extreme value analysis, performed the best across traditional outlier detection metrics.
arXiv Detail & Related papers (2024-01-02T23:08:49Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - CenTime: Event-Conditional Modelling of Censoring in Survival Analysis [49.44664144472712]
We introduce CenTime, a novel approach to survival analysis that directly estimates the time to event.
Our method features an innovative event-conditional censoring mechanism that performs robustly even when uncensored data is scarce.
Our results indicate that CenTime offers state-of-the-art performance in predicting time-to-death while maintaining comparable ranking performance.
arXiv Detail & Related papers (2023-09-07T17:07:33Z) - On the Universal Adversarial Perturbations for Efficient Data-free
Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs.
Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z) - Sensitive Data Detection with High-Throughput Machine Learning Models in
Electrical Health Records [15.982220037507169]
The Health Insurance Portability and Accountability Act of 1996 (HIPAA) is a federal law designed to protect sensitive health information (PHI)
One of the challenges in this area of research is the heterogeneous nature of PHI fields in data across different parties.
This variability makes rule-based sensitive variable identification systems that work on one database fail on another.
arXiv Detail & Related papers (2023-04-30T16:14:23Z) - Practical Challenges in Differentially-Private Federated Survival
Analysis of Medical Data [57.19441629270029]
In this paper, we take advantage of the inherent properties of neural networks to federate the process of training of survival analysis models.
In the realistic setting of small medical datasets and only a few data centers, this noise makes it harder for the models to converge.
We propose DPFed-post which adds a post-processing stage to the private federated learning scheme.
arXiv Detail & Related papers (2022-02-08T10:03:24Z) - Reliable and Trustworthy Machine Learning for Health Using Dataset Shift
Detection [7.263558963357268]
Unpredictable ML model behavior on unseen data, especially in the health domain, raises serious concerns about its safety.
We show that Mahalanobis distance- and Gram matrices-based out-of-distribution detection methods are able to detect out-of-distribution data with high accuracy.
We then translate the out-of-distribution score into a human interpretable CONFIDENCE SCORE to investigate its effect on the users' interaction with health ML applications.
arXiv Detail & Related papers (2021-10-26T20:49:01Z) - Health Status Prediction with Local-Global Heterogeneous Behavior Graph [69.99431339130105]
Estimation of health status can be achieved with various kinds of data streams continuously collected from wearable sensors.
We propose to model the behavior-related multi-source data streams with a local-global graph.
We take experiments on StudentLife dataset, and extensive results demonstrate the effectiveness of our proposed model.
arXiv Detail & Related papers (2021-03-23T11:10:04Z) - Measuring Data Collection Diligence for Community Healthcare [23.612133021992868]
Non-diligent data collection by community health workers (CHWs) is a significant challenge in developing countries.
In this work, we define and test a data collection diligence score.
Our framework has been validated on the ground using observations by the field monitors of our partner NGO in India.
arXiv Detail & Related papers (2020-11-05T16:45:03Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.