Comparative Evaluation of Explainable Machine Learning Versus Linear Regression for Predicting County-Level Lung Cancer Mortality Rate in the United States
- URL: http://arxiv.org/abs/2512.17934v1
- Date: Wed, 10 Dec 2025 23:33:12 GMT
- Title: Comparative Evaluation of Explainable Machine Learning Versus Linear Regression for Predicting County-Level Lung Cancer Mortality Rate in the United States
- Authors: Soheil Hashtarkhani, Brianna M. White, Benyamin Hoseini, David L. Schwartz, Arash Shaban-Nejad,
- Abstract summary: Lung cancer (LC) is a leading cause of cancer-related mortality in the United States.<n>This study applied three models to predict county-level LC mortality rates across the United States.
- Score: 0.1957338076370071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lung cancer (LC) is a leading cause of cancer-related mortality in the United States. Accurate prediction of LC mortality rates is crucial for guiding targeted interventions and addressing health disparities. Although traditional regression-based models have been commonly used, explainable machine learning models may offer enhanced predictive accuracy and deeper insights into the factors influencing LC mortality. This study applied three models: random forest (RF), gradient boosting regression (GBR), and linear regression (LR) to predict county-level LC mortality rates across the United States. Model performance was evaluated using R-squared and root mean squared error (RMSE). Shapley Additive Explanations (SHAP) values were used to determine variable importance and their directional impact. Geographic disparities in LC mortality were analyzed through Getis-Ord (Gi*) hotspot analysis. The RF model outperformed both GBR and LR, achieving an R2 value of 41.9% and an RMSE of 12.8. SHAP analysis identified smoking rate as the most important predictor, followed by median home value and the percentage of the Hispanic ethnic population. Spatial analysis revealed significant clusters of elevated LC mortality in the mid-eastern counties of the United States. The RF model demonstrated superior predictive performance for LC mortality rates, emphasizing the critical roles of smoking prevalence, housing values, and the percentage of Hispanic ethnic population. These findings offer valuable actionable insights for designing targeted interventions, promoting screening, and addressing health disparities in regions most affected by LC in the United States.
Related papers
- Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z) - Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency [52.50039435394964]
We systematically evaluate foundation models for regression-based tasks.<n>We extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models.<n>Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts.
arXiv Detail & Related papers (2026-01-29T14:06:50Z) - Statistical vs. Deep Learning Models for Estimating Substance Overdose Excess Mortality in the US [0.951591069547877]
Estimating excess mortality, defined as deaths beyond expected levels based on pre-pandemic patterns, is essential for understanding pandemic impacts and informing intervention strategies.<n>We present a systematic comparison of SARIMA against three deep learning (DL) architectures (LSTM, Seq2Seq, and Transformer) for counterfactual mortality estimation.<n>Our findings establish that carefully validated DL models can provide more reliable counterfactual estimates than traditional methods for public health planning.
arXiv Detail & Related papers (2025-12-25T00:49:59Z) - Fairness Evaluation of Risk Estimation Models for Lung Cancer Screening [0.6974609493696966]
We evaluate potential performance disparities and fairness in two deep learning risk estimation models for lung cancer screening.<n>Models were trained on data from the US-based National Lung Screening Trial (NLST)<n>We observed a statistically significant AUROC difference in Sybil's performance between women (0.88, 95% CI: 0.86, 0.90) and men (0.81, 95% CI: 0.78, 0.84, p .001).<n>At 90% specificity, Venkadesh21 showed lower sensitivity for Black (0.39, 95% CI: 0.23, 0.59) than White participants (0.69, 95% CI: 0.65,
arXiv Detail & Related papers (2025-12-23T19:57:21Z) - Case Prompting to Mitigate Large Language Model Bias for ICU Mortality Prediction [17.91443453604627]
Large language models (LLMs) show promise in predicting outcomes from structured medical data.<n>LLMs may exhibit demographic biases related to sex, age, and race, limiting their trustworthy use in clinical practice.<n>We propose a training-free, clinically adaptive prompting framework to simultaneously improve fairness and performance.
arXiv Detail & Related papers (2025-12-17T12:29:53Z) - Methodology for Comparing Machine Learning Algorithms for Survival Analysis [55.65997641180011]
Six machine learning models for survival analysis were evaluated.<n>XGB-AFT achieved the best performance (C-Index = 0.7618; IPCW = 0.7532, followed by GBSA and RSF)
arXiv Detail & Related papers (2025-10-28T14:42:28Z) - Interpretable Machine Learning for Life Expectancy Prediction: A Comparative Study of Linear Regression, Decision Tree, and Random Forest [0.0]
This study evaluates three machine learning models -- Linear Regression (LR), Regression Decision Tree (RDT), and Random Forest (RF)<n>RF achieves the highest predictive accuracy ($R2 = 0.9423$), significantly outperforming LR and RDT.<n>These insights underscore the synergy between ensemble methods and transparency in addressing public-health challenges.
arXiv Detail & Related papers (2025-10-01T06:02:31Z) - Early Mortality Prediction in ICU Patients with Hypertensive Kidney Disease Using Interpretable Machine Learning [3.4335475695580127]
Hypertensive kidney disease (HKD) patients in intensive care units (ICUs) face high short-term mortality.<n>We developed a machine learning framework to predict 30-day in-hospital mortality among ICU patients with HKD.
arXiv Detail & Related papers (2025-07-25T00:48:23Z) - Analyzing Geospatial and Socioeconomic Disparities in Breast Cancer Screening Among Populations in the United States: Machine Learning Approach [0.3958317527488535]
This study aims to assess breast cancer screening rates nationwide in the United States.<n>Data on mammography screening at the census tract level for 2018 and 2020 were collected.<n>We developed a large dataset of social determinants of health, comprising 13 variables for 72337 census tracts.
arXiv Detail & Related papers (2025-01-30T21:07:34Z) - Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [77.45983464131977]
We focus on how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications.<n>Our research identifies two critical latent factors affecting RAG's confidence in its predictions.<n>We develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers.
arXiv Detail & Related papers (2024-09-24T14:52:14Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Penalized Deep Partially Linear Cox Models with Application to CT Scans
of Lung Cancer Patients [42.09584755334577]
Lung cancer is a leading cause of cancer mortality globally, highlighting the importance of understanding its mortality risks to design effective therapies.
The National Lung Screening Trial (NLST) employed computed tomography texture analysis to quantify the mortality risks of lung cancer patients.
We propose a novel Penalized Deep Partially Linear Cox Model (Penalized DPLC), which incorporates the SCAD penalty to select important texture features and employs a deep neural network to estimate the nonparametric component of the model.
arXiv Detail & Related papers (2023-03-09T15:38:16Z) - Clinical Deterioration Prediction in Brazilian Hospitals Based on
Artificial Neural Networks and Tree Decision Models [56.93322937189087]
An extremely boosted neural network (XBNet) is used to predict clinical deterioration (CD)
The XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
arXiv Detail & Related papers (2022-12-17T23:29:14Z) - Increasing the efficiency of randomized trial estimates via linear
adjustment for a prognostic score [59.75318183140857]
Estimating causal effects from randomized experiments is central to clinical research.
Most methods for historical borrowing achieve reductions in variance by sacrificing strict type-I error rate control.
arXiv Detail & Related papers (2020-12-17T21:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.