HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology
- URL: http://arxiv.org/abs/2505.12120v1
- Date: Sat, 17 May 2025 18:59:32 GMT
- Title: HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology
- Authors: Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova,
- Abstract summary: HISTAI is a large, multimodal, open-access Whole Slide Image (WSI) collection comprising over 60,000 slides from various tissue types.<n>Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Digital Pathology (DP), particularly through artificial intelligence and Foundation Models, have underscored the importance of large-scale, diverse, and richly annotated datasets. Despite their critical role, publicly available Whole Slide Image (WSI) datasets often lack sufficient scale, tissue diversity, and comprehensive clinical metadata, limiting the robustness and generalizability of AI models. In response, we introduce the HISTAI dataset, a large, multimodal, open-access WSI collection comprising over 60,000 slides from various tissue types. Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding. The dataset aims to fill gaps identified in existing resources, promoting innovation, reproducibility, and the development of clinically relevant computational pathology solutions. The dataset can be accessed at https://github.com/HistAI/HISTAI.
Related papers
- Open-Set Semi-Supervised Learning for Long-Tailed Medical Datasets [17.82752126823939]
Real-world generalization requires taking into account the various complexities that can be encountered in the real-world.<n>We propose an open-set learning method for highly imbalanced medical datasets using a semi-supervised approach.<n>Our analysis shows that addressing the impact of long-tail data in classification significantly improves the overall performance of the network.
arXiv Detail & Related papers (2025-05-20T19:21:38Z) - A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI [70.06771291117965]
We introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset.<n>Biomedica contains over 6 million scientific articles and 24 million image-text pairs.<n>We provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems.
arXiv Detail & Related papers (2025-03-26T05:56:46Z) - MRGen: Segmentation Data Engine for Underrepresented MRI Modalities [59.61465292965639]
Training medical image segmentation models for rare yet clinically important imaging modalities is challenging due to the scarcity of annotated data.<n>This paper investigates leveraging generative models to synthesize data, for training segmentation models for underrepresented modalities.<n>We present MRGen, a data engine for controllable medical image synthesis conditioned on text prompts and segmentation masks.
arXiv Detail & Related papers (2024-12-04T16:34:22Z) - Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification [2.5091334993691206]
Development of a robust deep-learning model for retinal disease diagnosis requires a substantial dataset for training.
The capacity to generalize effectively on smaller datasets remains a persistent challenge.
We've combined a wide range of data sources to improve performance and generalization to new data.
arXiv Detail & Related papers (2024-09-17T17:22:35Z) - Dataset Distillation for Histopathology Image Classification [46.04496989951066]
We introduce a novel dataset distillation algorithm tailored for histopathology image datasets (Histo-DD)
We conduct a comprehensive evaluation of the effectiveness of the proposed algorithm and the generated histopathology samples in both patch-level and slide-level classification tasks.
arXiv Detail & Related papers (2024-08-19T05:53:38Z) - Large-scale Long-tailed Disease Diagnosis on Radiology Images [51.453990034460304]
RadDiag is a foundational model supporting 2D and 3D inputs across various modalities and anatomies.
Our dataset, RP3D-DiagDS, contains 40,936 cases with 195,010 scans covering 5,568 disorders.
arXiv Detail & Related papers (2023-12-26T18:20:48Z) - HEALNet: Multimodal Fusion for Heterogeneous Biomedical Data [10.774128925670183]
This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet), a flexible multimodal fusion architecture.
We conduct multimodal survival analysis on Whole Slide Images and Multi-omic data on four cancer datasets from The Cancer Genome Atlas (TCGA)
HEALNet achieves state-of-the-art performance compared to other end-to-end trained fusion models.
arXiv Detail & Related papers (2023-11-15T17:06:26Z) - Building Flexible, Scalable, and Machine Learning-ready Multimodal
Oncology Datasets [17.774341783844026]
This work proposes Multimodal Integration of Oncology Data System (MINDS)
MINDS is a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources.
By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability.
arXiv Detail & Related papers (2023-09-30T15:44:39Z) - GastroVision: A Multi-class Endoscopy Image Dataset for Computer Aided
Gastrointestinal Disease Detection [6.231109933741383]
This dataset includes different anatomical landmarks, pathological abnormalities, polyp removal cases and normal findings from the GI tract.
It was annotated and verified by experienced GI endoscopists.
We believe our dataset can facilitate the development of AI-based algorithms for GI disease detection and classification.
arXiv Detail & Related papers (2023-07-16T19:36:03Z) - CLCLSA: Cross-omics Linked embedding with Contrastive Learning and Self
Attention for multi-omics integration with incomplete multi-omics data [47.2764293508916]
Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data.
One obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost.
We propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention.
arXiv Detail & Related papers (2023-04-12T00:22:18Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization.
We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise.
We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.