The NCI Imaging Data Commons as a platform for reproducible research in
computational pathology
- URL: http://arxiv.org/abs/2303.09354v3
- Date: Tue, 7 Nov 2023 14:26:11 GMT
- Title: The NCI Imaging Data Commons as a platform for reproducible research in
computational pathology
- Authors: Daniela P. Schacherer, Markus D. Herrmann, David A. Clunie, Henning
H\"ofener, William Clifford, William J.R. Longabaugh, Steve Pieper, Ron
Kikinis, Andrey Fedorov, Andr\'e Homeyer
- Abstract summary: Reproducibility is a major challenge in developing machine learning (ML)-based solutions in computational pathology (CompPath)
The NCI Imaging Data Commons (IDC) provides >120 cancer image collections according to the FAIR principles and is designed to be used with cloud ML services.
We implement two experiments in which a representative ML-based method for classifying lung tumor tissue was trained and/or evaluated on different datasets.
- Score: 0.0773931605896092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background and Objectives: Reproducibility is a major challenge in developing
machine learning (ML)-based solutions in computational pathology (CompPath).
The NCI Imaging Data Commons (IDC) provides >120 cancer image collections
according to the FAIR principles and is designed to be used with cloud ML
services. Here, we explore its potential to facilitate reproducibility in
CompPath research.
Methods: Using the IDC, we implemented two experiments in which a
representative ML-based method for classifying lung tumor tissue was trained
and/or evaluated on different datasets. To assess reproducibility, the
experiments were run multiple times with separate but identically configured
instances of common ML services.
Results: The AUC values of different runs of the same experiment were
generally consistent. However, we observed small variations in AUC values of up
to 0.045, indicating a practical limit to reproducibility.
Conclusions: We conclude that the IDC facilitates approaching the
reproducibility limit of CompPath research (i) by enabling researchers to reuse
exactly the same datasets and (ii) by integrating with cloud ML services so
that experiments can be run in identically configured computing environments.
Related papers
- Is Limited Participant Diversity Impeding EEG-based Machine Learning? [12.258707843214946]
It is common practice to split EEG recordings into small segments, thereby increasing the number of samples.
We conceptualise this as a multi-level data generation process and investigate the scaling behaviour of model performance.
We then use the same framework to investigate the effectiveness of different ML strategies designed to address limited data problems.
arXiv Detail & Related papers (2025-03-11T12:04:59Z) - MLXP: A Framework for Conducting Replicable Experiments in Python [63.37350735954699]
We propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python.
It streamlines the experimental process with minimal overhead while ensuring a high level of practitioner overhead.
arXiv Detail & Related papers (2024-02-21T14:22:20Z) - Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables.
We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels.
We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z) - Multi-Study R-Learner for Estimating Heterogeneous Treatment Effects Across Studies Using Statistical Machine Learning [1.1045045527359925]
Estimating heterogeneous treatment effects (HTEs) is crucial for precision medicine.
Existing approaches often assume identical HTEs across studies.
We propose a framework for multi-study HTE estimation.
arXiv Detail & Related papers (2023-06-01T18:56:58Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - Enabling Reproducibility and Meta-learning Through a Lifelong Database
of Experiments (LDE) [0.43012765978447565]
We present the Lifelong Database of Experiments (LDE) that automatically extracts and stores linked metadata from experiment artifacts.
We store context from multiple stages of the AI development lifecycle including datasets, pipelines, how each is configured, and training runs with information about their runtime environment.
We perform two experiments on this metadata: 1) examining the variability of the performance metrics and 2) implementing a number of meta-learning algorithms on top of the data.
arXiv Detail & Related papers (2022-02-22T15:35:16Z) - Learning Robust Hierarchical Patterns of Human Brain across Many fMRI
Studies [2.451910407959205]
Resting-state fMRI has been shown to provide surrogate biomarkers for the analysis of various diseases.
To improve the statistical power of biomarkers and the understanding mechanism of the brain, pooling of multi-center studies has become increasingly popular.
But pooling the data from multiple sites introduces variations due to hardware, software, and environment.
arXiv Detail & Related papers (2021-05-13T20:10:00Z) - LCS-DIVE: An Automated Rule-based Machine Learning Visualization
Pipeline for Characterizing Complex Associations in Classification [0.7226144684379191]
This work introduces the LCS Discovery Visualization Environment (LCS-DIVE), an automated LCS interpretation pipeline for complex biomedical classification.
LCS-DIVE conducts modeling using a new scikit-learn implementation of ExSTraCS, an LCS designed to overcome noise and scalability in biomedical data mining.
It leverages feature-tracking scores and/or rules to automatically guide characterization of (1) feature importance (2) underlying additive, epistatic, and/or heterogeneous patterns of association, and (3) model-driven heterogeneous subgroups via clustering, visualization generation, and cluster interrogation.
arXiv Detail & Related papers (2021-04-26T19:47:03Z) - Continual Learning with Fully Probabilistic Models [70.3497683558609]
We present an approach for continual learning based on fully probabilistic (or generative) models of machine learning.
We propose a pseudo-rehearsal approach using a Gaussian Mixture Model (GMM) instance for both generator and classifier functionalities.
We show that GMR achieves state-of-the-art performance on common class-incremental learning problems at very competitive time and memory complexity.
arXiv Detail & Related papers (2021-04-19T12:26:26Z) - Sample-Efficient Reinforcement Learning via Counterfactual-Based Data
Augmentation [15.451690870640295]
In some scenarios such as healthcare, usually only few records are available for each patient, impeding the application of currentReinforcement learning algorithms.
We propose a data-efficient RL algorithm that exploits structural causal models (SCMs) to model the state dynamics.
We show that counterfactual outcomes are identifiable under mild conditions and that Q- learning on the counterfactual-based augmented data set converges to the optimal value function.
arXiv Detail & Related papers (2020-12-16T17:21:13Z) - Knowledge transfer across cell lines using Hybrid Gaussian Process
models with entity embedding vectors [62.997667081978825]
A large number of experiments are performed to develop a biochemical process.
Could we exploit data of already developed processes to make predictions for a novel process, we could significantly reduce the number of experiments needed.
arXiv Detail & Related papers (2020-11-27T17:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.