Related papers: A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Assessments

A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Assessments

URL: http://arxiv.org/abs/2008.12829v2
Date: Tue, 8 Sep 2020 20:31:35 GMT
Title: A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Assessments
Authors: Ryan J. Urbanowicz and Pranshu Suri and Yuhan Cui and Jason H. Moore and Karen Ruth and Rachael Stolzenberg-Solomon and Shannon M. Lynch
Abstract summary: We have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification. This 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms. We apply this pipeline to an epidemiological investigation of established and newly identified risk factors for cancer to evaluate how different sources of bias might be handled by ML algorithms.
Score: 2.9726886415710276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning (ML) offers a collection of powerful approaches for detecting and modeling associations, often applied to data having a large number of features and/or complex associations. Currently, there are many tools to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is also increasing in automated ML packages, which can make it easier for non-experts to apply ML and have the potential to improve model performance. ML permeates most subfields of biomedical research with varying levels of rigor and correct usage. Tremendous opportunities offered by ML are frequently offset by the challenge of assembling comprehensive analysis pipelines, and the ease of ML misuse. In this work we have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification (i.e. case/control prediction), and applied this pipeline to both simulated and real world data. At a high level, this 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms, each with hyperparameter optimization, and e) thorough evaluation, including appropriate metrics, statistical analyses, and novel visualizations. This pipeline organizes the many subtle complexities of ML pipeline assembly to illustrate best practices to avoid bias and ensure reproducibility. Additionally, this pipeline is the first to compare established ML algorithms to 'ExSTraCS', a rule-based ML algorithm with the unique capability of interpretably modeling heterogeneous patterns of association. While designed to be widely applicable we apply this pipeline to an epidemiological investigation of established and newly identified risk factors for pancreatic cancer to evaluate how different sources of bias might be handled by ML algorithms.

Related papers

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z)
Notes on Applicability of Explainable AI Methods to Machine Learning Models Using Features Extracted by Persistent Homology [0.0]
Persistent homology (PH) has found wide-ranging applications in machine learning. The ability to achieve satisfactory levels of accuracy with relatively simple downstream machine learning models, when processing these extracted features, underlines the pipeline's superior interpretability. We explore the potential application of explainable AI methodologies to this PH-ML pipeline.
arXiv Detail & Related papers (2023-10-15T08:56:15Z)
Closing the loop: Autonomous experiments enabled by machine-learning-based online data analysis in synchrotron beamline environments [80.49514665620008]
Machine learning can be used to enhance research involving large or rapidly generated datasets. In this study, we describe the incorporation of ML into a closed-loop workflow for X-ray reflectometry (XRR) We present solutions that provide an elementary data analysis in real time during the experiment without introducing the additional software dependencies in the beamline control software environment.
arXiv Detail & Related papers (2023-06-20T21:21:19Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison [0.49034553215430216]
STREAMLINE is a simple, transparent, end-to-end AutoML pipeline. It is specifically designed to compare performance between datasets, ML algorithms, and other AutoML tools.
arXiv Detail & Related papers (2022-06-23T22:40:58Z)
Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines [27.461398584509755]
DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline. Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
arXiv Detail & Related papers (2022-04-23T19:29:23Z)
Adaptive neighborhood Metric learning [184.95321334661898]
We propose a novel distance metric learning algorithm, named adaptive neighborhood metric learning (ANML) ANML can be used to learn both the linear and deep embeddings. The emphlog-exp mean function proposed in our method gives a new perspective to review the deep metric learning methods.
arXiv Detail & Related papers (2022-01-20T17:26:37Z)
Exploring Opportunistic Meta-knowledge to Reduce Search Spaces for Automated Machine Learning [8.325359814939517]
This paper investigates whether, based on previous experience, a pool of available classifiers/regressors can be preemptively culled ahead of initiating a pipeline composition/optimisation process.
arXiv Detail & Related papers (2021-05-01T15:25:30Z)
LCS-DIVE: An Automated Rule-based Machine Learning Visualization Pipeline for Characterizing Complex Associations in Classification [0.7226144684379191]
This work introduces the LCS Discovery Visualization Environment (LCS-DIVE), an automated LCS interpretation pipeline for complex biomedical classification. LCS-DIVE conducts modeling using a new scikit-learn implementation of ExSTraCS, an LCS designed to overcome noise and scalability in biomedical data mining. It leverages feature-tracking scores and/or rules to automatically guide characterization of (1) feature importance (2) underlying additive, epistatic, and/or heterogeneous patterns of association, and (3) model-driven heterogeneous subgroups via clustering, visualization generation, and cluster interrogation.
arXiv Detail & Related papers (2021-04-26T19:47:03Z)
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets. We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)
Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond [69.83813153444115]
We consider an efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference. Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances. We propose localized debiased machine learning (LDML), which avoids this burdensome step.
arXiv Detail & Related papers (2019-12-30T14:42:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.