Related papers: Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data

Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data

URL: http://arxiv.org/abs/2507.17791v1
Date: Wed, 23 Jul 2025 10:33:35 GMT
Title: Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data
Authors: Eduardo Aguilar-Bejarano, Daniel Lea, Karthikeyan Sivakumar, Jimiama M. Mase, Reza Omidvar, Ruizhe Li, Troy Kettle, James Mitchell-White, Morgan R Alexander, David A Winkler, Grazziela Figueredo,
Abstract summary: Helix is an open-source, Python-based software framework to facilitate reproducible and interpretable machine learning.<n>It addresses the growing need for transparent experimental data analytics.<n>Released under the MIT licence, Helix is accessible via GitHub and PyPI.
Score: 1.433481719062383
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Helix is an open-source, extensible, Python-based software framework to facilitate reproducible and interpretable machine learning workflows for tabular data. It addresses the growing need for transparent experimental data analytics provenance, ensuring that the entire analytical process -- including decisions around data transformation and methodological choices -- is documented, accessible, reproducible, and comprehensible to relevant stakeholders. The platform comprises modules for standardised data preprocessing, visualisation, machine learning model training, evaluation, interpretation, results inspection, and model prediction for unseen data. To further empower researchers without formal training in data science to derive meaningful and actionable insights, Helix features a user-friendly interface that enables the design of computational experiments, inspection of outcomes, including a novel interpretation approach to machine learning decisions using linguistic terms all within an integrated environment. Released under the MIT licence, Helix is accessible via GitHub and PyPI, supporting community-driven development and promoting adherence to the FAIR principles.

Related papers

From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence [91.54446789584826]
Epiplexity is a formalization of information capturing what computationally bounded observers can learn from data.<n>We show how information can be created with computation, how it depends on the ordering of the data, and how likelihood modeling can produce more complex programs than present in the data generating process itself.
arXiv Detail & Related papers (2026-01-06T18:04:03Z)
Bias Begins with Data: The FairGround Corpus for Robust and Reproducible Research on Algorithmic Fairness [42.93319580186729]
Machine learning (ML) systems are increasingly adopted in high-stakes decision-making domains.<n>At the core of fair ML research are the datasets used to investigate bias and develop mitigation strategies.<n>We present FairGround: a unified framework, data corpus, and Python package aimed at advancing reproducible research.
arXiv Detail & Related papers (2025-10-25T16:48:33Z)
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z)
A Visual Tool for Interactive Model Explanation using Sensitivity Analysis [0.0]
We present SAInT, a Python-based tool for exploring and understanding the behavior of Machine Learning (ML) models.<n>Our system supports Human-in-the-Loop attribution (HITL) by enabling users to configure, train, evaluate, and explain models.<n>We demonstrate the system on a classification task predicting survival on the Titanic dataset and show how sensitivity information can guide feature selection and data refinement.
arXiv Detail & Related papers (2025-08-06T09:53:31Z)
From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience [1.136688282190268]
Reproducibility remains a central challenge in machine learning (ML)<n>Current ML are often fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools.<n>This paper introduces a data-centric framework for lifecycle-aware artifacts.
arXiv Detail & Related papers (2025-06-19T06:09:01Z)
A new framework for X-ray absorption spectroscopy data analysis based on machine learning: XASDAML [3.26781102547109]
XASDAML is a flexible, machine learning based framework that integrates the entire data-processing workflow.<n>It supports comprehensive statistical analysis, leveraging methods such as principal component analysis and clustering.<n>The versatility and effectiveness of XASDAML are exemplified by its application to a copper dataset.
arXiv Detail & Related papers (2025-02-23T17:50:04Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
Medical artificial intelligence toolbox (MAIT): an explainable machine learning framework for binary classification, survival modelling, and regression analyses [0.0]
Medical Artificial Intelligence Toolbox (MAIT) is an explainable, open-source Python pipeline for developing and evaluating binary classification, regression, and survival models.<n>MAIT addresses key challenges (e.g., high dimensionality, class imbalance, mixed variable types, and missingness) while promoting transparency in reporting.<n>We provide detailed tutorials on GitHub, using four open-access data sets, to demonstrate how MAIT can be used to improve implementation and interpretation of ML models in medical research.
arXiv Detail & Related papers (2025-01-08T14:51:36Z)
Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems [0.0]
Synthetic datasets are important for evaluating and testing machine learning models.<n>We develop a novel framework for generating synthetic datasets that are diverse and statistically coherent.<n>The framework is available as a free open Python package to facilitate research with minimal friction.
arXiv Detail & Related papers (2024-11-27T09:53:14Z)
KAXAI: An Integrated Environment for Knowledge Analysis and Explainable AI [0.0]
The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation. The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability.
arXiv Detail & Related papers (2023-12-30T10:20:47Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Meta-learning using privileged information for dynamics [66.32254395574994]
We extend the Neural ODE Process model to use additional information within the Learning Using Privileged Information setting. We validate our extension with experiments showing improved accuracy and calibration on simulated dynamics tasks.
arXiv Detail & Related papers (2021-04-29T12:18:02Z)
PyHealth: A Python Library for Health Predictive Models [53.848478115284195]
PyHealth is an open-source Python toolbox for developing various predictive models on healthcare data. The data preprocessing module enables the transformation of complex healthcare datasets into machine learning friendly formats. The predictive modeling module provides more than 30 machine learning models, including established ensemble trees and deep neural network-based approaches.
arXiv Detail & Related papers (2021-01-11T22:02:08Z)
Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks. The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver. R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.