Related papers: MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets

MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets

URL: http://arxiv.org/abs/2411.00200v1
Date: Thu, 31 Oct 2024 20:36:37 GMT
Title: MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets
Authors: Nassim Oufattole, Teya Bergamaschi, Aleksia Kolo, Hyewon Jeong, Hanna Gaggin, Collin M. Stultz, Matthew B. A. McDermott,
Abstract summary: This work is powered by complementary advances in core data standardization through the MEDS framework. We dramatically simplify and accelerate this process of scalably featurizing irregularly sampled time-series data. This system will greatly enhance the reliability, scalable, and ease of development of powerful ML solutions for health problems across diverse datasets and clinical settings.
Score: 2.8209943093430443
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective, reliable, and scalable development of machine learning (ML) solutions for structured electronic health record (EHR) data requires the ability to reliably generate high-quality baseline models for diverse supervised learning tasks in an efficient and performant manner. Historically, producing such baseline models has been a largely manual effort--individual researchers would need to decide on the particular featurization and tabularization processes to apply to their individual raw, longitudinal data; and then train a supervised model over those data to produce a baseline result to compare novel methods against, all for just one task and one dataset. In this work, powered by complementary advances in core data standardization through the MEDS framework, we dramatically simplify and accelerate this process of tabularizing irregularly sampled time-series data, providing researchers the ability to automatically and scalably featurize and tabularize their longitudinal EHR data across tens of thousands of individual features, hundreds of millions of clinical events, and diverse windowing horizons and aggregation strategies, all before ultimately leveraging these tabular data to automatically produce high-caliber XGBoost baselines in a highly computationally efficient manner. This system scales to dramatically larger datasets than tabularization tools currently available to the community and enables researchers with any MEDS format dataset to immediately begin producing reliable and performant baseline prediction results on various tasks, with minimal human effort required. This system will greatly enhance the reliability, reproducibility, and ease of development of powerful ML solutions for health problems across diverse datasets and clinical settings.

Related papers

DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative AI Foundation Models [0.0]
Generative AI foundation models offer transformative potential for processing structured biological data.<n>We propose the use of agentic foundation models with real-time web search to automate the labeling of experimental data, achieving up to 82.5% accuracy.
arXiv Detail & Related papers (2025-06-14T23:30:22Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
CAAT-EHR: Cross-Attentional Autoregressive Transformer for Multimodal Electronic Health Record Embeddings [0.0]
We introduce CAAT-EHR, a novel architecture designed to generate task-agnostic longitudinal embeddings from raw EHR data. An autoregressive decoder complements the encoder by predicting future time points data during pre-training, ensuring that the resulting embeddings maintain temporal consistency and alignment.
arXiv Detail & Related papers (2025-01-31T05:00:02Z)
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data. We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective. We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
An Efficient Contrastive Unimodal Pretraining Method for EHR Time Series Data [35.943089444017666]
We propose an efficient method of contrastive pretraining tailored for long clinical timeseries data. Our model demonstrates the ability to impute missing measurements, providing clinicians with deeper insights into patient conditions.
arXiv Detail & Related papers (2024-10-11T19:05:25Z)
Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification [2.5091334993691206]
Development of a robust deep-learning model for retinal disease diagnosis requires a substantial dataset for training. The capacity to generalize effectively on smaller datasets remains a persistent challenge. We've combined a wide range of data sources to improve performance and generalization to new data.
arXiv Detail & Related papers (2024-09-17T17:22:35Z)
Towards Precision Healthcare: Robust Fusion of Time Series and Image Data [8.579651833717763]
We introduce a new method that uses two separate encoders, one for each type of data, allowing the model to understand complex patterns in both visual and time-based information. We also deal with imbalanced datasets and use an uncertainty loss function, yielding improved results. Our experiments show that our method is effective in improving multimodal deep learning for clinical applications.
arXiv Detail & Related papers (2024-05-24T11:18:13Z)
Automated Multi-Task Learning for Joint Disease Prediction on Electronic Health Records [4.159498069487535]
We propose an automated approach named AutoDP, which can search for the optimal configuration of task grouping and architectures simultaneously. It achieves significant performance improvements over both hand-crafted and automated state-of-the-art methods, also maintains a feasible search cost at the same time.
arXiv Detail & Related papers (2024-03-06T22:32:48Z)
Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets [17.774341783844026]
This work proposes Multimodal Integration of Oncology Data System (MINDS) MINDS is a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability.
arXiv Detail & Related papers (2023-09-30T15:44:39Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Towards an Automatic Analysis of CHO-K1 Suspension Growth in Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data. Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation [97.42894942391575]
We propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks. Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.
arXiv Detail & Related papers (2020-06-25T09:57:47Z)
Self-Training with Improved Regularization for Sample-Efficient Chest X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios. Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.