Toward Scalable Early Cancer Detection: Evaluating EHR-Based Predictive Models Against Traditional Screening Criteria
- URL: http://arxiv.org/abs/2511.11293v1
- Date: Fri, 14 Nov 2025 13:26:49 GMT
- Title: Toward Scalable Early Cancer Detection: Evaluating EHR-Based Predictive Models Against Traditional Screening Criteria
- Authors: Jiheum Park, Chao Pang, Tristan Y. Lee, Jeong Yun Yang, Jacob Berkowitz, Alexander Z. Wei, Nicholas Tatonetti,
- Abstract summary: Current cancer screening guidelines cover only a few cancer types.<n> Predictive models using electronic health records may provide a more effective tool for identifying high-risk groups.
- Score: 36.59932276624224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current cancer screening guidelines cover only a few cancer types and rely on narrowly defined criteria such as age or a single risk factor like smoking history, to identify high-risk individuals. Predictive models using electronic health records (EHRs), which capture large-scale longitudinal patient-level health information, may provide a more effective tool for identifying high-risk groups by detecting subtle prediagnostic signals of cancer. Recent advances in large language and foundation models have further expanded this potential, yet evidence remains limited on how useful HER-based models are compared with traditional risk factors currently used in screening guidelines. We systematically evaluated the clinical utility of EHR-based predictive models against traditional risk factors, including gene mutations and family history of cancer, for identifying high-risk individuals across eight major cancers (breast, lung, colorectal, prostate, ovarian, liver, pancreatic, and stomach), using data from the All of Us Research Program, which integrates EHR, genomic, and survey data from over 865,000 participants. Even with a baseline modeling approach, EHR-based models achieved a 3- to 6-fold higher enrichment of true cancer cases among individuals identified as high risk compared with traditional risk factors alone, whether used as a standalone or complementary tool. The EHR foundation model, a state-of-the-art approach trained on comprehensive patient trajectories, further improved predictive performance across 26 cancer types, demonstrating the clinical potential of EHR-based predictive modeling to support more precise and scalable early detection strategies.
Related papers
- Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z) - Reasoning Language Model for Personalized Lung Cancer Screening [10.241766336141685]
Lung CT Screening Reporting and Data System (Lung-RADS) faces trade-offs between sensitivity and specificity.<n>We propose a reasoning language model (RLM) to integrate radiology findings with longitudinal medical records for individualized lung cancer risk assessment.
arXiv Detail & Related papers (2025-09-07T18:38:39Z) - A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer [54.58205672910646]
RenalCLIP is a visual-language foundation model for characterization, diagnosis and prognosis of renal mass.<n>It achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer.
arXiv Detail & Related papers (2025-08-22T17:48:19Z) - Cohort-Aware Agents for Individualized Lung Cancer Risk Prediction Using a Retrieval-Augmented Model Selection Framework [4.828586430285072]
Lung cancer risk prediction remains challenging due to substantial variability across patient populations and clinical settings.<n>We propose a personalized lung cancer risk prediction agent that dynamically selects the most appropriate model for each patient.
arXiv Detail & Related papers (2025-08-20T02:59:39Z) - Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models [22.56012797703626]
We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models.<n>Caught-FM identifies high-risk patients for further screening solely based on their historical medical records.<n>Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors.
arXiv Detail & Related papers (2025-05-30T20:31:09Z) - Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z) - Deep learning-based identification of patients at increased risk of cancer using routine laboratory markers [2.482109221766753]
Cancer screening involves an initial risk stratification step to determine the screening method and frequency.<n>For most screening programs, age and clinical risk factors such as family history are part of the initial risk stratification algorithm.<n>We demonstrate that the combination of simple, widely available blood tests, such as complete blood count and complete metabolic panel, could potentially be used to identify patients at risk for colorectal, liver, and lung cancers.
arXiv Detail & Related papers (2024-10-25T15:50:27Z) - Penalized Deep Partially Linear Cox Models with Application to CT Scans
of Lung Cancer Patients [42.09584755334577]
Lung cancer is a leading cause of cancer mortality globally, highlighting the importance of understanding its mortality risks to design effective therapies.
The National Lung Screening Trial (NLST) employed computed tomography texture analysis to quantify the mortality risks of lung cancer patients.
We propose a novel Penalized Deep Partially Linear Cox Model (Penalized DPLC), which incorporates the SCAD penalty to select important texture features and employs a deep neural network to estimate the nonparametric component of the model.
arXiv Detail & Related papers (2023-03-09T15:38:16Z) - SurvLatent ODE : A Neural ODE based time-to-event model with competing
risks for longitudinal data improves cancer-associated Deep Vein Thrombosis
(DVT) prediction [68.8204255655161]
We propose a generative time-to-event model, SurvLatent ODE, which parameterizes a latent representation under irregularly sampled data.
Our model then utilizes the latent representation to flexibly estimate survival times for multiple competing events without specifying shapes of event-specific hazard function.
SurvLatent ODE outperforms the current clinical standard Khorana Risk scores for stratifying DVT risk groups.
arXiv Detail & Related papers (2022-04-20T17:28:08Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - A Large-scale Multimodal Study for Predicting Mortality Risk Using Minimal and Low Parameter Models and Separable Risk Assessment [0.0]
We develop and validate several multimodal models that can predict 1-year mortality based on a massive clinical dataset.<n>Our focus on predicting 1-year mortality can provide a sense of urgency to the patients.
arXiv Detail & Related papers (2019-01-23T20:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.