A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary
Classification: Application in Pancreatic Cancer Nested Case-control Studies
with Implications for Bias Assessments
- URL: http://arxiv.org/abs/2008.12829v2
- Date: Tue, 8 Sep 2020 20:31:35 GMT
- Title: A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary
Classification: Application in Pancreatic Cancer Nested Case-control Studies
with Implications for Bias Assessments
- Authors: Ryan J. Urbanowicz and Pranshu Suri and Yuhan Cui and Jason H. Moore
and Karen Ruth and Rachael Stolzenberg-Solomon and Shannon M. Lynch
- Abstract summary: We have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification.
This 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms.
We apply this pipeline to an epidemiological investigation of established and newly identified risk factors for cancer to evaluate how different sources of bias might be handled by ML algorithms.
- Score: 2.9726886415710276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning (ML) offers a collection of powerful approaches for
detecting and modeling associations, often applied to data having a large
number of features and/or complex associations. Currently, there are many tools
to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is
also increasing in automated ML packages, which can make it easier for
non-experts to apply ML and have the potential to improve model performance. ML
permeates most subfields of biomedical research with varying levels of rigor
and correct usage. Tremendous opportunities offered by ML are frequently offset
by the challenge of assembling comprehensive analysis pipelines, and the ease
of ML misuse. In this work we have laid out and assembled a complete, rigorous
ML analysis pipeline focused on binary classification (i.e. case/control
prediction), and applied this pipeline to both simulated and real world data.
At a high level, this 'automated' but customizable pipeline includes a)
exploratory analysis, b) data cleaning and transformation, c) feature
selection, d) model training with 9 established ML algorithms, each with
hyperparameter optimization, and e) thorough evaluation, including appropriate
metrics, statistical analyses, and novel visualizations. This pipeline
organizes the many subtle complexities of ML pipeline assembly to illustrate
best practices to avoid bias and ensure reproducibility. Additionally, this
pipeline is the first to compare established ML algorithms to 'ExSTraCS', a
rule-based ML algorithm with the unique capability of interpretably modeling
heterogeneous patterns of association. While designed to be widely applicable
we apply this pipeline to an epidemiological investigation of established and
newly identified risk factors for pancreatic cancer to evaluate how different
sources of bias might be handled by ML algorithms.
Related papers
- Notes on Applicability of Explainable AI Methods to Machine Learning
Models Using Features Extracted by Persistent Homology [0.0]
Persistent homology (PH) has found wide-ranging applications in machine learning.
The ability to achieve satisfactory levels of accuracy with relatively simple downstream machine learning models, when processing these extracted features, underlines the pipeline's superior interpretability.
We explore the potential application of explainable AI methodologies to this PH-ML pipeline.
arXiv Detail & Related papers (2023-10-15T08:56:15Z) - Closing the loop: Autonomous experiments enabled by
machine-learning-based online data analysis in synchrotron beamline
environments [80.49514665620008]
Machine learning can be used to enhance research involving large or rapidly generated datasets.
In this study, we describe the incorporation of ML into a closed-loop workflow for X-ray reflectometry (XRR)
We present solutions that provide an elementary data analysis in real time during the experiment without introducing the additional software dependencies in the beamline control software environment.
arXiv Detail & Related papers (2023-06-20T21:21:19Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning
Pipeline Facilitating Data Analysis and Algorithm Comparison [0.49034553215430216]
STREAMLINE is a simple, transparent, end-to-end AutoML pipeline.
It is specifically designed to compare performance between datasets, ML algorithms, and other AutoML tools.
arXiv Detail & Related papers (2022-06-23T22:40:58Z) - Data Debugging with Shapley Importance over End-to-End Machine Learning
Pipelines [27.461398584509755]
DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline.
Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
arXiv Detail & Related papers (2022-04-23T19:29:23Z) - Adaptive neighborhood Metric learning [184.95321334661898]
We propose a novel distance metric learning algorithm, named adaptive neighborhood metric learning (ANML)
ANML can be used to learn both the linear and deep embeddings.
The emphlog-exp mean function proposed in our method gives a new perspective to review the deep metric learning methods.
arXiv Detail & Related papers (2022-01-20T17:26:37Z) - Exploring Opportunistic Meta-knowledge to Reduce Search Spaces for
Automated Machine Learning [8.325359814939517]
This paper investigates whether, based on previous experience, a pool of available classifiers/regressors can be preemptively culled ahead of initiating a pipeline composition/optimisation process.
arXiv Detail & Related papers (2021-05-01T15:25:30Z) - LCS-DIVE: An Automated Rule-based Machine Learning Visualization
Pipeline for Characterizing Complex Associations in Classification [0.7226144684379191]
This work introduces the LCS Discovery Visualization Environment (LCS-DIVE), an automated LCS interpretation pipeline for complex biomedical classification.
LCS-DIVE conducts modeling using a new scikit-learn implementation of ExSTraCS, an LCS designed to overcome noise and scalability in biomedical data mining.
It leverages feature-tracking scores and/or rules to automatically guide characterization of (1) feature importance (2) underlying additive, epistatic, and/or heterogeneous patterns of association, and (3) model-driven heterogeneous subgroups via clustering, visualization generation, and cluster interrogation.
arXiv Detail & Related papers (2021-04-26T19:47:03Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - Localized Debiased Machine Learning: Efficient Inference on Quantile
Treatment Effects and Beyond [69.83813153444115]
We consider an efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference.
Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances.
We propose localized debiased machine learning (LDML), which avoids this burdensome step.
arXiv Detail & Related papers (2019-12-30T14:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.