HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models
- URL: http://arxiv.org/abs/2405.07460v3
- Date: Thu, 13 Jun 2024 16:22:04 GMT
- Title: HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models
- Authors: Aakash Tripathi, Asim Waqas, Yasin Yilmaz, Ghulam Rasool,
- Abstract summary: HoneyBee is a scalable modular framework for building multimodal oncology datasets.
It generates embeddings that capture the essential features and relationships within the raw medical data.
HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
- Score: 17.774341783844026
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundation models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical diagnostic and pathology imaging data, medical notes, reports, records, and molecular data. It employs data preprocessing techniques and foundation models to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of these embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
Related papers
- Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification [2.5091334993691206]
Development of a robust deep-learning model for retinal disease diagnosis requires a substantial dataset for training.
The capacity to generalize effectively on smaller datasets remains a persistent challenge.
We've combined a wide range of data sources to improve performance and generalization to new data.
arXiv Detail & Related papers (2024-09-17T17:22:35Z) - MedPix 2.0: A Comprehensive Multimodal Biomedical Dataset for Advanced AI Applications [0.0]
This paper illustrates the entire workflow for building the data set MedPix 2.0.
Along with the dataset, we developed a GUI aimed at navigating efficiently the MongoDB instance.
We also propose a CLIP-based model trained on MedPix 2.0 for scan classification tasks.
arXiv Detail & Related papers (2024-07-03T10:49:21Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - OpenMEDLab: An Open-source Platform for Multi-modality Foundation Models
in Medicine [55.29668193415034]
We present OpenMEDLab, an open-source platform for multi-modality foundation models.
It encapsulates solutions of pioneering attempts in prompting and fine-tuning large language and vision models for frontline clinical and bioinformatic applications.
It opens access to a group of pre-trained foundation models for various medical image modalities, clinical text, protein engineering, etc.
arXiv Detail & Related papers (2024-02-28T03:51:02Z) - HEALNet: Multimodal Fusion for Heterogeneous Biomedical Data [10.774128925670183]
This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet), a flexible multimodal fusion architecture.
We conduct multimodal survival analysis on Whole Slide Images and Multi-omic data on four cancer datasets from The Cancer Genome Atlas (TCGA)
HEALNet achieves state-of-the-art performance compared to other end-to-end trained fusion models.
arXiv Detail & Related papers (2023-11-15T17:06:26Z) - Building Flexible, Scalable, and Machine Learning-ready Multimodal
Oncology Datasets [17.774341783844026]
This work proposes Multimodal Integration of Oncology Data System (MINDS)
MINDS is a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources.
By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability.
arXiv Detail & Related papers (2023-09-30T15:44:39Z) - medigan: A Python Library of Pretrained Generative Models for Enriched
Data Access in Medical Imaging [3.8568465270960264]
medigan is a one-stop shop for pretrained generative models implemented as an open-source framework-agnostic Python library.
It allows researchers and developers to create, increase, and domain-adapt their training data in just a few lines of code.
The library's scalability and design is demonstrated by its growing number of integrated and readily-usable pretrained generative models.
arXiv Detail & Related papers (2022-09-28T23:45:33Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - Scientific Language Models for Biomedical Knowledge Base Completion: An
Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction.
We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.