Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models
- URL: http://arxiv.org/abs/2506.00209v1
- Date: Fri, 30 May 2025 20:31:09 GMT
- Title: Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models
- Authors: Liwen Sun, Hao-Ren Yao, Gary Gao, Ophir Frieder, Chenyan Xiong,
- Abstract summary: We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models.<n>Caught-FM identifies high-risk patients for further screening based on historical medical records.<n>Our analysis demonstrates the robustness of CATCH-FM in various patient distributions.
- Score: 17.739985240125733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieved strong efficacy (60% sensitivity) with low risk (99% specificity and Negative Predictive Value), outperforming feature-based tree models as well as general and medical large language models by large margins. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.
Related papers
- Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z) - PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology [33.51485504161335]
We present PathBench, the first comprehensive benchmark for pathology foundation models (PFMs)<n>Our framework incorporates large-scale data, enabling objective comparison of PFMs.<n>We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks.
arXiv Detail & Related papers (2025-05-26T16:42:22Z) - ColonScopeX: Leveraging Explainable Expert Systems with Multimodal Data for Improved Early Diagnosis of Colorectal Cancer [3.541280502270993]
Colorectal cancer (CRC) ranks as the second leading cause of cancer-related deaths and the third most prevalent malignant tumour worldwide.<n>Early detection of CRC remains problematic due to its non-specific and often embarrassing symptoms.<n>We propose ColonScopeX, a machine learning framework utilizing explainable AI (XAI) methodologies to enhance the early detection of CRC and pre-cancerous lesions.
arXiv Detail & Related papers (2025-04-09T20:45:11Z) - Predicting Survivability of Cancer Patients with Metastatic Patterns Using Explainable AI [2.6182192515316247]
This study leverages machine learning (ML) to predict the survivability of cancer patients with metastatic patterns.<n>MSK-MET dataset includes genomic and clinical data from 25,775 patients across 27 cancer types.<n>XGBoost emerged as the best performer with an area under the curve (AUC) of 0.82.
arXiv Detail & Related papers (2025-04-07T20:48:15Z) - Machine Learning Solutions Integrated in an IoT Healthcare Platform for Heart Failure Risk Stratification [3.952604803580729]
The management of chronic Heart Failure (HF) presents significant challenges in modern healthcare.<n>We present a predictive model founded on Machine Learning (ML) techniques to identify patients at HF risk.
arXiv Detail & Related papers (2025-04-07T14:07:05Z) - Boosting Medical Image-based Cancer Detection via Text-guided Supervision from Reports [68.39938936308023]
We propose a novel text-guided learning method to achieve highly accurate cancer detection results.
Our approach can leverage clinical knowledge by large-scale pre-trained VLM to enhance generalization ability.
arXiv Detail & Related papers (2024-05-23T07:03:38Z) - Cancer-Net PCa-Gen: Synthesis of Realistic Prostate Diffusion Weighted
Imaging Data via Anatomic-Conditional Controlled Latent Diffusion [68.45407109385306]
In Canada, prostate cancer is the most common form of cancer in men and accounted for 20% of new cancer cases for this demographic in 2022.
There has been significant interest in the development of deep neural networks for prostate cancer diagnosis, prognosis, and treatment planning using diffusion weighted imaging (DWI) data.
In this study, we explore the efficacy of latent diffusion for generating realistic prostate DWI data through the introduction of an anatomic-conditional controlled latent diffusion strategy.
arXiv Detail & Related papers (2023-11-30T15:11:03Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - A Multi-Institutional Open-Source Benchmark Dataset for Breast Cancer
Clinical Decision Support using Synthetic Correlated Diffusion Imaging Data [82.74877848011798]
Cancer-Net BCa is a multi-institutional open-source benchmark dataset of volumetric CDI$s$ imaging data of breast cancer patients.
Cancer-Net BCa is publicly available as a part of a global open-source initiative dedicated to accelerating advancement in machine learning to aid clinicians in the fight against cancer.
arXiv Detail & Related papers (2023-04-12T05:41:44Z) - Penalized Deep Partially Linear Cox Models with Application to CT Scans
of Lung Cancer Patients [42.09584755334577]
Lung cancer is a leading cause of cancer mortality globally, highlighting the importance of understanding its mortality risks to design effective therapies.
The National Lung Screening Trial (NLST) employed computed tomography texture analysis to quantify the mortality risks of lung cancer patients.
We propose a novel Penalized Deep Partially Linear Cox Model (Penalized DPLC), which incorporates the SCAD penalty to select important texture features and employs a deep neural network to estimate the nonparametric component of the model.
arXiv Detail & Related papers (2023-03-09T15:38:16Z) - Deep-CR MTLR: a Multi-Modal Approach for Cancer Survival Prediction with
Competing Risks [0.4189643331553922]
We present Deep-CR MTLR -- a novel machine learning approach for accurate cancer survival prediction.
We demonstrate improved prognostic performance of the multi-modal approach over single modality predictors in a cohort of 2552 head and neck cancer patients.
arXiv Detail & Related papers (2020-12-10T15:51:47Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.