Related papers: Best Practices for Machine Learning Experimentation in Scientific Applications

Best Practices for Machine Learning Experimentation in Scientific Applications

URL: http://arxiv.org/abs/2511.21354v2
Date: Thu, 27 Nov 2025 06:48:18 GMT
Title: Best Practices for Machine Learning Experimentation in Scientific Applications
Authors: Umberto Michelucci, Francesca Venturini,
Abstract summary: This paper presents a practical and structured guide for conducting machine learning experiments in scientific applications.<n>We outline a step-by-step workflow, from dataset preparation to model selection and evaluation.<n>We propose metrics that account for overfitting and instability across folds, including the Logarithmic Overfitting Ratio (LOR) and the Composite Overfitting Score (COS)
Score: 3.093890460224435
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Machine learning (ML) is increasingly adopted in scientific research, yet the quality and reliability of results often depend on how experiments are designed and documented. Poor baselines, inconsistent preprocessing, or insufficient validation can lead to misleading conclusions about model performance. This paper presents a practical and structured guide for conducting ML experiments in scientific applications, focussing on reproducibility, fair comparison, and transparent reporting. We outline a step-by-step workflow, from dataset preparation to model selection and evaluation, and propose metrics that account for overfitting and instability across validation folds, including the Logarithmic Overfitting Ratio (LOR) and the Composite Overfitting Score (COS). Through recommended practices and example reporting formats, this work aims to support researchers in establishing robust baselines and drawing valid evidence-based insights from ML models applied to scientific problems.

Related papers

Evaluating Large Language Models in Scientific Discovery [91.732562776782]
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge.<n>We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics.<n>The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance.
arXiv Detail & Related papers (2025-12-17T16:20:03Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.<n>Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.<n>We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.<n>We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.<n>Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
Stronger Baseline Models -- A Key Requirement for Aligning Machine Learning Research with Clinical Utility [0.0]
Well-known barriers exist when attempting to deploy Machine Learning models in high-stakes, clinical settings. We show empirically that including stronger baseline models in evaluations has important downstream effects. We propose some best practices that will enable practitioners to more effectively study and deploy ML models in clinical settings.
arXiv Detail & Related papers (2024-09-18T16:38:37Z)
Unraveling overoptimism and publication bias in ML-driven science [14.38643099447636]
Recent studies suggest published performance of Machine Learning models are often overoptimistic. We introduce a novel model for observed accuracy, integrating parametric learning curves and the aforementioned biases. Applying the model to meta-analyses of classifications of neurological conditions, we estimate the inherent limits of ML-based prediction in each domain.
arXiv Detail & Related papers (2024-05-23T10:43:20Z)
An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models [55.01592097059969]
Supervised finetuning on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool. We propose using experimental design to circumvent the computational bottlenecks of active learning.
arXiv Detail & Related papers (2024-01-12T16:56:54Z)
mlscorecheck: Testing the consistency of reported performance scores and experiments in machine learning [0.0]
We have developed numerical techniques capable of identifying inconsistencies between reported performance scores and various experimental setups in machine learning problems. These consistency tests are integrated into the open-source package mlscorecheck.
arXiv Detail & Related papers (2023-11-13T18:31:48Z)
Evaluating the Effectiveness of Retrieval-Augmented Large Language Models in Scientific Document Reasoning [0.0]
Large Language Model (LLM) often provide seemingly plausible but not factual information, often referred to as hallucinations. Retrieval-augmented LLMs provide a non-parametric approach to solve these issues by retrieving relevant information from external data sources. We critically evaluate these models in their ability to perform in scientific document reasoning tasks.
arXiv Detail & Related papers (2023-11-07T21:09:57Z)
Benchopt: Reproducible, efficient and collaborative optimization benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z)
On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods [20.2027063607352]
We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting. Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results. We believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
arXiv Detail & Related papers (2022-06-24T14:46:19Z)
FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes [51.02407217197623]
We propose a two-stage training algorithm named FAIRIF. It minimizes the loss over the reweighted data set where the sample weights are computed. We show that FAIRIF yields models with better fairness-utility trade-offs against various types of bias.
arXiv Detail & Related papers (2022-01-15T05:14:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.