Julearn: an easy-to-use library for leakage-free evaluation and
inspection of ML models
- URL: http://arxiv.org/abs/2310.12568v1
- Date: Thu, 19 Oct 2023 08:21:12 GMT
- Title: Julearn: an easy-to-use library for leakage-free evaluation and
inspection of ML models
- Authors: Sami Hamdan, Shammi More, Leonard Sasse, Vera Komeyer, Kaustubh R.
Patil and Federico Raimondo (for the Alzheimer's Disease Neuroimaging
Initiative)
- Abstract summary: We present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects.
Julearn aims to simplify the entry into the machine learning world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls.
- Score: 0.23301643766310373
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The fast-paced development of machine learning (ML) methods coupled with its
increasing adoption in research poses challenges for researchers without
extensive training in ML. In neuroscience, for example, ML can help understand
brain-behavior relationships, diagnose diseases, and develop biomarkers using
various data sources like magnetic resonance imaging and
electroencephalography. The primary objective of ML is to build models that can
make accurate predictions on unseen data. Researchers aim to prove the
existence of such generalizable models by evaluating performance using
techniques such as cross-validation (CV), which uses systematic subsampling to
estimate the generalization performance. Choosing a CV scheme and evaluating an
ML pipeline can be challenging and, if used improperly, can lead to
overestimated results and incorrect interpretations.
We created julearn, an open-source Python library, that allow researchers to
design and evaluate complex ML pipelines without encountering in common
pitfalls. In this manuscript, we present the rationale behind julearn's design,
its core features, and showcase three examples of previously-published research
projects that can be easily implemented using this novel library. Julearn aims
to simplify the entry into the ML world by providing an easy-to-use environment
with built in guards against some of the most common ML pitfalls. With its
design, unique features and simple interface, it poses as a useful Python-based
library for research projects.
Related papers
- MLXP: A Framework for Conducting Replicable Experiments in Python [63.37350735954699]
We propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python.
It streamlines the experimental process with minimal overhead while ensuring a high level of practitioner overhead.
arXiv Detail & Related papers (2024-02-21T14:22:20Z) - DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows [72.40917624485822]
We introduce DataDreamer, an open source Python library that allows researchers to implement powerful large language models.
DataDreamer also helps researchers adhere to best practices that we propose to encourage open science.
arXiv Detail & Related papers (2024-02-16T00:10:26Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Learn to Unlearn: A Survey on Machine Unlearning [29.077334665555316]
This article presents a review of recent machine unlearning techniques, verification mechanisms, and potential attacks.
We highlight emerging challenges and prospective research directions.
We aim for this paper to provide valuable resources for integrating privacy, equity, andresilience into ML systems.
arXiv Detail & Related papers (2023-05-12T14:28:02Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - The Integration of Machine Learning into Automated Test Generation: A
Systematic Mapping Study [15.016047591601094]
We characterize emerging research, examining testing practices, researcher goals, ML techniques applied, evaluation, and challenges.
ML generates input for system, GUI, unit, performance, and testing or improves the performance of existing generation methods.
arXiv Detail & Related papers (2022-06-21T09:26:25Z) - PyRelationAL: A Library for Active Learning Research and Development [0.11545092788508224]
PyRelationAL is an open source library for active learning (AL) research.
It provides access to benchmark datasets and AL task configurations based on existing literature.
We perform experiments on the PyRelationAL collection of benchmark datasets and showcase the considerable economies that AL can provide.
arXiv Detail & Related papers (2022-05-23T08:21:21Z) - What Makes Good Contrastive Learning on Small-Scale Wearable-based
Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task.
This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z) - A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary
Classification: Application in Pancreatic Cancer Nested Case-control Studies
with Implications for Bias Assessments [2.9726886415710276]
We have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification.
This 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms.
We apply this pipeline to an epidemiological investigation of established and newly identified risk factors for cancer to evaluate how different sources of bias might be handled by ML algorithms.
arXiv Detail & Related papers (2020-08-28T19:58:05Z) - Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data
Principles [0.0]
We describe our goals and initial steps in supporting the end-to-end of machine learning pipelines.
We investigate which factors beyond the availability of source code and datasets influence the influence of ML experiments.
We propose ways to apply FAIR data practices to ML experiments.
arXiv Detail & Related papers (2020-06-22T10:17:34Z) - Bayesian active learning for production, a systematic study and a
reusable library [85.32971950095742]
In this paper, we analyse the main drawbacks of current active learning techniques.
We do a systematic study on the effects of the most common issues of real-world datasets on the deep active learning process.
We derive two techniques that can speed up the active learning loop such as partial uncertainty sampling and larger query size.
arXiv Detail & Related papers (2020-06-17T14:51:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.