Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data
- URL: http://arxiv.org/abs/2507.17791v1
- Date: Wed, 23 Jul 2025 10:33:35 GMT
- Title: Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data
- Authors: Eduardo Aguilar-Bejarano, Daniel Lea, Karthikeyan Sivakumar, Jimiama M. Mase, Reza Omidvar, Ruizhe Li, Troy Kettle, James Mitchell-White, Morgan R Alexander, David A Winkler, Grazziela Figueredo,
- Abstract summary: Helix is an open-source, Python-based software framework to facilitate reproducible and interpretable machine learning.<n>It addresses the growing need for transparent experimental data analytics.<n>Released under the MIT licence, Helix is accessible via GitHub and PyPI.
- Score: 1.433481719062383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Helix is an open-source, extensible, Python-based software framework to facilitate reproducible and interpretable machine learning workflows for tabular data. It addresses the growing need for transparent experimental data analytics provenance, ensuring that the entire analytical process -- including decisions around data transformation and methodological choices -- is documented, accessible, reproducible, and comprehensible to relevant stakeholders. The platform comprises modules for standardised data preprocessing, visualisation, machine learning model training, evaluation, interpretation, results inspection, and model prediction for unseen data. To further empower researchers without formal training in data science to derive meaningful and actionable insights, Helix features a user-friendly interface that enables the design of computational experiments, inspection of outcomes, including a novel interpretation approach to machine learning decisions using linguistic terms all within an integrated environment. Released under the MIT licence, Helix is accessible via GitHub and PyPI, supporting community-driven development and promoting adherence to the FAIR principles.
Related papers
- A Visual Tool for Interactive Model Explanation using Sensitivity Analysis [0.0]
We present SAInT, a Python-based tool for exploring and understanding the behavior of Machine Learning (ML) models.<n>Our system supports Human-in-the-Loop attribution (HITL) by enabling users to configure, train, evaluate, and explain models.<n>We demonstrate the system on a classification task predicting survival on the Titanic dataset and show how sensitivity information can guide feature selection and data refinement.
arXiv Detail & Related papers (2025-08-06T09:53:31Z) - From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience [1.136688282190268]
Reproducibility remains a central challenge in machine learning (ML)<n>Current ML are often fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools.<n>This paper introduces a data-centric framework for lifecycle-aware artifacts.
arXiv Detail & Related papers (2025-06-19T06:09:01Z) - A new framework for X-ray absorption spectroscopy data analysis based on machine learning: XASDAML [3.26781102547109]
XASDAML is a flexible, machine learning based framework that integrates the entire data-processing workflow.<n>It supports comprehensive statistical analysis, leveraging methods such as principal component analysis and clustering.<n>The versatility and effectiveness of XASDAML are exemplified by its application to a copper dataset.
arXiv Detail & Related papers (2025-02-23T17:50:04Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Medical artificial intelligence toolbox (MAIT): an explainable machine learning framework for binary classification, survival modelling, and regression analyses [0.0]
Medical Artificial Intelligence Toolbox (MAIT) is an explainable, open-source Python pipeline for developing and evaluating binary classification, regression, and survival models.<n>MAIT addresses key challenges (e.g., high dimensionality, class imbalance, mixed variable types, and missingness) while promoting transparency in reporting.<n>We provide detailed tutorials on GitHub, using four open-access data sets, to demonstrate how MAIT can be used to improve implementation and interpretation of ML models in medical research.
arXiv Detail & Related papers (2025-01-08T14:51:36Z) - Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems [0.0]
Synthetic datasets are important for evaluating and testing machine learning models.<n>We develop a novel framework for generating synthetic datasets that are diverse and statistically coherent.<n>The framework is available as a free open Python package to facilitate research with minimal friction.
arXiv Detail & Related papers (2024-11-27T09:53:14Z) - KAXAI: An Integrated Environment for Knowledge Analysis and Explainable
AI [0.0]
The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation.
The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability.
arXiv Detail & Related papers (2023-12-30T10:20:47Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Meta-learning using privileged information for dynamics [66.32254395574994]
We extend the Neural ODE Process model to use additional information within the Learning Using Privileged Information setting.
We validate our extension with experiments showing improved accuracy and calibration on simulated dynamics tasks.
arXiv Detail & Related papers (2021-04-29T12:18:02Z) - PyHealth: A Python Library for Health Predictive Models [53.848478115284195]
PyHealth is an open-source Python toolbox for developing various predictive models on healthcare data.
The data preprocessing module enables the transformation of complex healthcare datasets into machine learning friendly formats.
The predictive modeling module provides more than 30 machine learning models, including established ensemble trees and deep neural network-based approaches.
arXiv Detail & Related papers (2021-01-11T22:02:08Z) - Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks.
The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver.
R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.