AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning
- URL: http://arxiv.org/abs/2505.15931v1
- Date: Wed, 21 May 2025 18:36:05 GMT
- Title: AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning
- Authors: Morteza Alizadeh, Mehrdad Oveisi, Sonya Falahati, Ghazal Mousavi, Mohsen Alambardar Meybodi, Somayeh Sadat Mehrnia, Ilker Hacihaliloglu, Arman Rahmim, Mohammad R. Salmanpour,
- Abstract summary: We introduce AllMetrics, an open-source unified Python library designed to standardize metric evaluation across diverse machine learning tasks.<n>The library implements class-specific reporting for multi-class tasks through parameters to cover all use cases.<n>Various datasets from domains like healthcare, finance, and real estate were applied to our library and compared with Python, Matlab, and R components.
- Score: 2.325084918639609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning (ML) models rely heavily on consistent and accurate performance metrics to evaluate and compare their effectiveness. However, existing libraries often suffer from fragmentation, inconsistent implementations, and insufficient data validation protocols, leading to unreliable results. Existing libraries have often been developed independently and without adherence to a unified standard, particularly concerning the specific tasks they aim to support. As a result, each library tends to adopt its conventions for metric computation, input/output formatting, error handling, and data validation protocols. This lack of standardization leads to both implementation differences (ID) and reporting differences (RD), making it difficult to compare results across frameworks or ensure reliable evaluations. To address these issues, we introduce AllMetrics, an open-source unified Python library designed to standardize metric evaluation across diverse ML tasks, including regression, classification, clustering, segmentation, and image-to-image translation. The library implements class-specific reporting for multi-class tasks through configurable parameters to cover all use cases, while incorporating task-specific parameters to resolve metric computation discrepancies across implementations. Various datasets from domains like healthcare, finance, and real estate were applied to our library and compared with Python, Matlab, and R components to identify which yield similar results. AllMetrics combines a modular Application Programming Interface (API) with robust input validation mechanisms to ensure reproducibility and reliability in model evaluation. This paper presents the design principles, architectural components, and empirical analyses demonstrating the ability to mitigate evaluation errors and to enhance the trustworthiness of ML workflows.
Related papers
- CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - A Distance Metric for Mixed Integer Programming Instances [0.0]
Mixed-integer linear programming (MILP) is a powerful tool for addressing a wide range of real-world problems.<n>Existing similarity metrics often lack precision in identifying instance classes or rely heavily on labeled data.<n>This paper introduces the first mathematical distance metric for MILP instances, derived directly from their mathematical formulations.
arXiv Detail & Related papers (2025-07-15T07:55:09Z) - OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z) - Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks [0.0]
We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks.<n>These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
arXiv Detail & Related papers (2025-01-08T02:17:34Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - Are Large Language Models Memorizing Bug Benchmarks? [6.640077652362016]
Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair.<n>A growing concern within the software engineering community is that benchmarks may not reliably reflect true LLM performance due to the risk of data leakage.<n>We systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks.
arXiv Detail & Related papers (2024-11-20T13:46:04Z) - Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.<n>We collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts.<n>Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z) - Threshold-Consistent Margin Loss for Open-World Deep Metric Learning [42.03620337000911]
Existing losses used in deep metric learning (DML) for image retrieval often lead to highly non-uniform intra-class and inter-class representation structures.
Inconsistency often complicates the threshold selection process when deploying commercial image retrieval systems.
We propose a novel variance-based metric called Operating-Point-Inconsistency-Score (OPIS) that quantifies the variance in the operating characteristics across classes.
arXiv Detail & Related papers (2023-07-08T21:16:41Z) - Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals.
Model-to-Match uses variable importance measurements to construct a distance metric.
We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z) - SacreROUGE: An Open-Source Library for Using and Developing
Summarization Evaluation Metrics [74.28810048824519]
SacreROUGE is an open-source library for using and developing summarization evaluation metrics.
The library provides Python wrappers around the official implementations of existing evaluation metrics.
It provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments.
arXiv Detail & Related papers (2020-07-10T13:26:37Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.