MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
- URL: http://arxiv.org/abs/2305.12081v4
- Date: Tue, 30 Apr 2024 22:23:48 GMT
- Title: MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
- Authors: Zifeng Wang, Chufan Gao, Cao Xiao, Jimeng Sun,
- Abstract summary: Tabular data prediction has been employed in medical applications such as patient health risk prediction.
Previous predictors are often trained on manually curated small datasets.
- Score: 44.693325083735424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. Medical tabular datasets frequently exhibit significant heterogeneity across different sources, with limited sample sizes per source. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages large language models (LLMs) to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a "learn, annotate, and refinement" pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without fine-tuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1.57 and 1.00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8.9% and 17.2% on average in two prediction tasks, respectively.
Related papers
- TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware Uncertainty Calibrator and Label Distribution Handler [29.395855812763617]
We propose AdapTable, a framework for adapting machine learning models to target data without accessing source data.
AdapTable operates in two stages: 1) calibrating model predictions using a shift-aware uncertainty calibrator, and 2) adjusting these predictions to match the target label distribution with a label distribution handler.
Our results demonstrate AdapTable's ability to handle various real-world distribution shifts, achieving up to a 16% improvement on the dataset.
arXiv Detail & Related papers (2024-07-15T15:02:53Z) - A Closer Look at Deep Learning on Tabular Data [52.50778536274327]
Tabular data is prevalent across various domains in machine learning.
Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones.
arXiv Detail & Related papers (2024-07-01T04:24:07Z) - Large Scale Transfer Learning for Tabular Data via Language Modeling [30.44823668480631]
We present TabuLa-8B, a language model for tabular prediction.
We show that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing.
We release our model, code, and data along with the publication of this paper.
arXiv Detail & Related papers (2024-06-17T18:58:20Z) - Cross-Table Pretraining towards a Universal Function Space for Heterogeneous Tabular Data [35.61663559675556]
Cross-dataset pretraining has shown notable success in various fields.
In this study, we introduce a cross-table pretrained Transformer, XTFormer, for versatile downstream tabular prediction tasks.
Our methodology is pretraining XTFormer to establish a "meta-function" space that encompasses all potential feature-target mappings.
arXiv Detail & Related papers (2024-06-01T03:24:31Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - P-Transformer: A Prompt-based Multimodal Transformer Architecture For
Medical Tabular Data [2.6487114372147182]
We propose P-Transformer, a Prompt-based multimodal Transformer architecture designed specifically for medical tabular data.
The framework efficiently encodes diverse modalities from both structured and unstructured data into a harmonized language semantic space.
P-Transformer demonstrated the improvements with 10.9%/11.0% on RMSE/MAE, 0.5%/2.2% on RMSE/MAE, and 1.6%/0.8% on BACC/AUROC compared to state-of-the-art (SOTA) baselines in predictability.
arXiv Detail & Related papers (2023-03-30T14:25:44Z) - Learning Enhanced Representations for Tabular Data via Neighborhood
Propagation [24.485479610138498]
We construct a hypergraph to model the cross-row and cross-column patterns of data instances.
We then perform message propagation to enhance the target data instance representation.
Experiments on two important data prediction tasks validate the superiority of the proposed PET model.
arXiv Detail & Related papers (2022-06-14T04:24:52Z) - Unsupervised Pre-Training on Patient Population Graphs for Patient-Level
Predictions [48.02011627390706]
Pre-training has shown success in different areas of machine learning, such as Computer Vision (CV), Natural Language Processing (NLP) and medical imaging.
In this paper, we apply unsupervised pre-training to heterogeneous, multi-modal EHR data for patient outcome prediction.
We find that our proposed graph based pre-training method helps in modeling the data at a population level.
arXiv Detail & Related papers (2022-03-23T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.