Related papers: BenchMake: Turn any scientific data set into a reproducible benchmark

BenchMake: Turn any scientific data set into a reproducible benchmark

URL: http://arxiv.org/abs/2506.23419v1
Date: Sun, 29 Jun 2025 22:56:48 GMT
Title: BenchMake: Turn any scientific data set into a reproducible benchmark
Authors: Amanda S Barnard,
Abstract summary: The relative rarity of benchmark sets in computational science makes evaluating new innovations difficult.<n>A new tool is developed to potentially turn any of the increasing numbers of scientific data sets made openly available into a benchmark accessible to the community.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmark data sets are a cornerstone of machine learning development and applications, ensuring new methods are robust, reliable and competitive. The relative rarity of benchmark sets in computational science, due to the uniqueness of the problems and the pace of change in the associated domains, makes evaluating new innovations difficult for computational scientists. In this paper a new tool is developed and tested to potentially turn any of the increasing numbers of scientific data sets made openly available into a benchmark accessible to the community. BenchMake uses non-negative matrix factorisation to deterministically identify and isolate challenging edge cases on the convex hull (the smallest convex set that contains all existing data instances) and partitions a required fraction of matched data instances into a testing set that maximises divergence and statistical significance, across tabular, graph, image, signal and textual modalities. BenchMake splits are compared to establish splits and random splits using ten publicly available benchmark sets from different areas of science, with different sizes, shapes, distributions.

Related papers

On the Interconnections of Calibration, Quantification, and Classifier Accuracy Prediction under Dataset Shift [58.91436551466064]
This paper investigates the interconnections among three fundamental problems, calibration, and quantification, under dataset shift conditions.<n>We show that access to an oracle for any one of these tasks enables the resolution of the other two.<n>We propose new methods for each problem based on direct adaptations of well-established methods borrowed from the other disciplines.
arXiv Detail & Related papers (2025-05-16T15:42:55Z)
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities [30.123976500620834]
Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models.<n>We propose ONEBench, a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool.<n>By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets.
arXiv Detail & Related papers (2024-12-09T18:37:14Z)
Benchmarking Video Frame Interpolation [11.918489436283748]
We present a benchmark which establishes consistent error metrics by utilizing a submission website that computes them. We also present a test set adhering to the assumption of linearity by utilizing synthetic data, and evaluate the computational efficiency in a coherent manner.
arXiv Detail & Related papers (2024-03-25T19:13:12Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Heterogeneous Datasets for Federated Survival Analysis Simulation [6.489759672413373]
This work proposes a novel technique for constructing realistic heterogeneous datasets by starting from existing non-federated datasets in a reproducible way. Specifically, we provide two novel dataset-splitting algorithms based on the Dirichlet distribution to assign each data sample to a carefully chosen client. The implementation of the proposed methods is publicly available in favor of and to encourage common practices to simulate federated environments for survival analysis.
arXiv Detail & Related papers (2023-01-28T11:37:07Z)
Integrating Transformer and Autoencoder Techniques with Spectral Graph Algorithms for the Prediction of Scarcely Labeled Molecular Data [2.8360662552057323]
This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge. Specifically, graph-based modifications of the MBO scheme is integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder. The proposed models are validated using five benchmark data sets.
arXiv Detail & Related papers (2022-11-12T22:45:32Z)
Composite Feature Selection using Deep Ensembles [130.72015919510605]
We investigate the problem of discovering groups of predictive features without predefined grouping. We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups. We propose a new metric to measure similarity between discovered groups and the ground truth.
arXiv Detail & Related papers (2022-11-01T17:49:40Z)
Statistical Comparisons of Classifiers by Generalized Stochastic Dominance [0.0]
There is still no consensus on how to compare classifiers over multiple data sets with respect to several criteria. In this paper, we add a fresh view to the vivid debate by adopting recent developments in decision theory. We show that our framework ranks classifiers by a generalized concept of dominance, which powerfully circumvents the cumbersome, and often even self-contradictory, reliance on aggregates.
arXiv Detail & Related papers (2022-09-05T09:28:15Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models. Our main insight is to reframe the risk-control problem as multiple hypothesis testing. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.