A Multimodal Data Processing Pipeline for MIMIC-IV Dataset
- URL: http://arxiv.org/abs/2601.11606v1
- Date: Thu, 08 Jan 2026 20:05:05 GMT
- Title: A Multimodal Data Processing Pipeline for MIMIC-IV Dataset
- Authors: Farzana Islam Adiba, Varsha Danduri, Fahmida Liza Piya, Ali Abbasi, Mehak Gupta, Rahmatollah Beheshti,
- Abstract summary: MIMIC-IV is a large, publicly available electronic health record (EHR) resource widely used for clinical machine learning research.<n>It comprises multiple modalities, including structured data, clinical notes, waveforms, and imaging data.<n>While several pipelines for MIMIC-IV data extraction are available, they target a small subset of modalities or do not fully support arbitrary downstream applications.<n>In this work, we greatly expand our prior popular unimodal pipeline and present a comprehensive and customizable multimodal pipeline.
- Score: 6.536530002576318
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The MIMIC-IV dataset is a large, publicly available electronic health record (EHR) resource widely used for clinical machine learning research. It comprises multiple modalities, including structured data, clinical notes, waveforms, and imaging data. Working with these disjointed modalities requires an extensive manual effort to preprocess and align them for downstream analysis. While several pipelines for MIMIC-IV data extraction are available, they target a small subset of modalities or do not fully support arbitrary downstream applications. In this work, we greatly expand our prior popular unimodal pipeline and present a comprehensive and customizable multimodal pipeline that can significantly reduce multimodal processing time and enhance the reproducibility of MIMIC-based studies. Our pipeline systematically integrates the listed modalities, enabling automated cohort selection, temporal alignment across modalities, and standardized multimodal output formats suitable for arbitrary static and time-series downstream applications. We release the code, a simple UI, and a Python package for selective integration (with embedding) at https://github.com/healthylaife/MIMIC-IV-Data-Pipeline.
Related papers
- SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis [2.74994442100348]
We present SurvBench, a comprehensive, open-source preprocessing pipeline that transforms raw PhysioNet datasets into model-ready tensors for multi-modal survival analysis.<n>SurvBench provides data loaders for three major critical care databases, MIMIC-IV, eICU, and MC-MED.<n>The pipeline implements rigorous data quality controls, patient-level splitting to prevent data leakage, explicit missingness tracking, and standardised temporal aggregation.
arXiv Detail & Related papers (2025-11-14T23:19:14Z) - Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning [49.04912820721943]
Supervised fine-tuning (SFT) is computationally expensive and sometimes suffers from overfitting or bias amplification.<n>This work studies the online batch selection family that dynamically scores and filters samples during the training process.<n>We develop textbfUDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT.
arXiv Detail & Related papers (2025-10-19T15:32:01Z) - MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.<n>We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.<n>With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z) - On Domain-Adaptive Post-Training for Multimodal Large Language Models [78.65220510401045]
This paper systematically investigates domain adaptation of MLLMs via post-training.<n>We focus on data synthesis, training pipeline, and task evaluation.<n>We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z) - Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data [35.85909368345219]
We introduce Infinity-MM, a large-scale multimodal instruction dataset.<n>We perform unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy.<n>We propose a synthetic instruction generation method based on a tagging system and open-source Vision-Language Models.
arXiv Detail & Related papers (2024-10-24T09:03:48Z) - Multi-Modal Dataset Creation for Federated Learning with DICOM Structured Reports [26.2463670182172]
Federated training is often hindered by heterogeneous datasets due to divergent data storage options, inconsistent naming schemes, varied annotation procedures, and disparities in label quality.
This is particularly evident in the emerging multi-modal learning paradigms, where dataset harmonization including a uniform data representation and filtering options are of paramount importance.
We developed an open platform for data integration and interactive filtering capabilities that simplifies the process of assembling multi-modal datasets.
arXiv Detail & Related papers (2024-07-12T07:34:10Z) - JUMP: A joint multimodal registration pipeline for neuroimaging with
minimal preprocessing [1.3549498237473223]
We present a pipeline for unbiased and robust registration of neuroimaging modalities with minimal pre-processing.
The pipeline currently works with structural MRI, resting state fMRI and amyloid PET images.
We show the predictive power of the derived biomarkers using in a case-control study and study the cross-modal relationship between different image modalities.
arXiv Detail & Related papers (2024-01-25T15:40:19Z) - Convolutional Monge Mapping Normalization for learning on sleep data [63.22081662149488]
We propose a new method called Convolutional Monge Mapping Normalization (CMMN)
CMMN consists in filtering the signals in order to adapt their power spectrum density (PSD) to a Wasserstein barycenter estimated on training data.
Numerical experiments on sleep EEG data show that CMMN leads to significant and consistent performance gains independent from the neural network architecture.
arXiv Detail & Related papers (2023-05-30T08:24:01Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - An Extensive Data Processing Pipeline for MIMIC-IV [0.20326203100766121]
We provide an end-to-end fully customizable pipeline to extract, clean, and pre-process data.
We predict and evaluate the fourth version of the MIMIC dataset (MIMIC-IV) for ICU and non-ICU-related clinical time-series prediction tasks.
arXiv Detail & Related papers (2022-04-29T01:09:38Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - A DICOM Framework for Machine Learning Pipelines against Real-Time
Radiology Images [50.222197963803644]
Niffler is an integrated framework that enables the execution of machine learning pipelines at research clusters.
Niffler uses the Digital Imaging and Communications in Medicine (DICOM) protocol to fetch and store imaging data.
We present its architecture and three of its use cases: an inferior vena cava filter detection from the images in real-time, identification of scanner utilization, and scanner clock calibration.
arXiv Detail & Related papers (2020-04-16T21:06:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.