Methodological Precedence in Health Tech: Why ML/Big Data Analysis Must Follow Basic Epidemiological Consistency. A Case Study
- URL: http://arxiv.org/abs/2511.07500v1
- Date: Wed, 12 Nov 2025 01:01:51 GMT
- Title: Methodological Precedence in Health Tech: Why ML/Big Data Analysis Must Follow Basic Epidemiological Consistency. A Case Study
- Authors: Marco Roccetti,
- Abstract summary: This study highlights a crucial cautionary principle: sophisticated analyses amplify, rather than correct, severe methodological flaws.<n>We demonstrate that the observed effects are mathematical artifacts stemming from an uncorrected selection bias in the cohort construction.<n>This analysis serves as a robust reminder that even the most complex health studies must first pass the test of basic epidemiological consistency.
- Score: 0.33842793760651557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The integration of advanced analytical tools, including Machine Learning (ML) and massive data processing, has revolutionized health research, promising unprecedented accuracy in diagnosis and risk prediction. However, the rigor of these complex methods is fundamentally dependent on the quality and integrity of the underlying datasets and the validity of their statistical design. We propose an emblematic case where advanced analysis (ML/Big Data) must necessarily be subsequent to the verification of basic methodological coherence. This study highlights a crucial cautionary principle: sophisticated analyses amplify, rather than correct, severe methodological flaws rooted in basic design choices, leading to misleading or contradictory findings. By applying simple, standard descriptive statistical methods and established national epidemiological benchmarks to a recently published cohort study on vaccine outcomes and psychiatric events, we expose multiple, statistically irreconcilable paradoxes. These paradoxes, including an implausible risk reduction for a chronic disorder in a high-risk group and contradictory incidence rate comparisons, definitively invalidate the reported hazard ratios (HRs). We demonstrate that the observed effects are mathematical artifacts stemming from an uncorrected selection bias in the cohort construction. This analysis serves as a robust reminder that even the most complex health studies must first pass the test of basic epidemiological consistency before any conclusion drawn from subsequent advanced ML or statistical modeling can be considered valid or publishable. We conclude that robust methods, such as Propensity Score Matching, are essential for achieving valid causal inference from administrative data in the absence of randomization
Related papers
- Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency [52.50039435394964]
We systematically evaluate foundation models for regression-based tasks.<n>We extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models.<n>Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts.
arXiv Detail & Related papers (2026-01-29T14:06:50Z) - UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles [42.37986459997699]
We propose an approach for decomposing uncertainty in SHAP values into aleatoric, epistemic, and entanglement components.<n>We validate the method across three real-world use cases with descriptive statistical analyses.<n>This understanding can guide the development of robust decision-making processes and the refinement of models in high-stakes applications.
arXiv Detail & Related papers (2025-08-13T09:20:33Z) - Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z) - AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis [16.21270312974956]
We introduce a novel statistical framework to evaluate the dependency of medical imaging ML models on sensitive attributes, such as demographics.<n>We present a practical algorithm that combines conditional latent diffusion models with statistical hypothesis testing to identify and quantify such biases.
arXiv Detail & Related papers (2025-04-28T09:28:25Z) - General targeted machine learning for modern causal mediation analysis [3.813608775141218]
Causal mediation analyses investigate the mechanisms through which causes exert their effects.<n>We show that the identification formulas for six popular non-parametric approaches to mediation analysis can be recovered from just two statistical estimands.<n>We propose an all-purpose one-step estimation algorithm that can be coupled with machine learning in any mediation study.
arXiv Detail & Related papers (2024-08-26T20:31:26Z) - SepsisLab: Early Sepsis Prediction with Uncertainty Quantification and Active Sensing [67.8991481023825]
Sepsis is the leading cause of in-hospital mortality in the USA.
Existing predictive models are usually trained on high-quality data with few missing information.
For the potential high-risk patients with low confidence due to limited observations, we propose a robust active sensing algorithm.
arXiv Detail & Related papers (2024-07-24T04:47:36Z) - Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic
Health Records Data [7.422597776308963]
We propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features.
In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses.
arXiv Detail & Related papers (2022-06-18T03:52:43Z) - Differential privacy and robust statistics in high dimensions [49.50869296871643]
High-dimensional Propose-Test-Release (HPTR) builds upon three crucial components: the exponential mechanism, robust statistics, and the Propose-Test-Release mechanism.
We show that HPTR nearly achieves the optimal sample complexity under several scenarios studied in the literature.
arXiv Detail & Related papers (2021-11-12T06:36:40Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z) - Statistical Agnostic Mapping: a Framework in Neuroimaging based on
Concentration Inequalities [0.0]
We derive a Statistical Agnostic (non-parametric) Mapping at voxel or multi-voxel level.
We propose a novel framework in neuroimaging based on concentration inequalities.
arXiv Detail & Related papers (2019-12-27T18:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.