Related papers: Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry

Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry

URL: http://arxiv.org/abs/2010.00110v2
Date: Fri, 22 Jan 2021 22:24:00 GMT
Title: Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry
Authors: Gaurav Vishwakarma, Aditya Sonpal, Johannes Hachmann
Abstract summary: This review aims to draw attention to two issues of concern when we set out to make machine learning benchmarking work in the chemical and materials domain. They are often overlooked or underappreciated topics as chemists typically only have limited training in statistics. These metrics are also key to comparing the performance of different models and thus for developing guidelines and best practices for the successful application of machine learning in chemistry.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This review aims to draw attention to two issues of concern when we set out to make machine learning work in the chemical and materials domain, i.e., statistical loss function metrics for the validation and benchmarking of data-derived models, and the uncertainty quantification of predictions made by them. They are often overlooked or underappreciated topics as chemists typically only have limited training in statistics. Aside from helping to assess the quality, reliability, and applicability of a given model, these metrics are also key to comparing the performance of different models and thus for developing guidelines and best practices for the successful application of machine learning in chemistry.

Related papers

Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks [0.0]
We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
arXiv Detail & Related papers (2025-01-08T02:17:34Z)
Developing a Dataset-Adaptive, Normalized Metric for Machine Learning Model Assessment: Integrating Size, Complexity, and Class Imbalance [0.0]
Traditional metrics like accuracy, F1-score, and precision are frequently used to evaluate machine learning models. A dataset-adaptive, normalized metric that incorporates dataset characteristics like size, feature dimensionality, class imbalance, and signal-to-noise ratio is presented.
arXiv Detail & Related papers (2024-12-10T07:10:00Z)
Causal Inference Tools for a Better Evaluation of Machine Learning [0.0]
We introduce key statistical methods such as Ordinary Least Squares (OLS) regression, Analysis of Variance (ANOVA) and logistic regression. The document serves as a guide for researchers and practitioners, detailing how these techniques can provide deeper insights into model behavior, performance, and fairness.
arXiv Detail & Related papers (2024-10-02T10:03:29Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets. We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
A Novel Metric for Measuring Data Quality in Classification Applications (extended version) [0.0]
We introduce and explain a novel metric to measure data quality. This metric is based on the correlated evolution between the classification performance and the deterioration of data. We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z)
On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models. We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process. Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
Evaluating Explainability in Machine Learning Predictions through Explainer-Agnostic Metrics [0.0]
We develop six distinct model-agnostic metrics designed to quantify the extent to which model predictions can be explained. These metrics measure different aspects of model explainability, ranging from local importance, global importance, and surrogate predictions. We demonstrate the practical utility of these metrics on classification and regression tasks, and integrate these metrics into an existing Python package for public use.
arXiv Detail & Related papers (2023-02-23T15:28:36Z)
Evaluating Machine Unlearning via Epistemic Uncertainty [78.27542864367821]
This work presents an evaluation of Machine Unlearning algorithms based on uncertainty. This is the first definition of a general evaluation of our best knowledge.
arXiv Detail & Related papers (2022-08-23T09:37:31Z)
Model Comparison and Calibration Assessment: User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice [0.0]
This user guide revisits and clarifies statistical techniques to assess the calibration or adequacy of a model. It focuses mainly on the pedagogical presentation of existing results and of best practice. Results are accompanied and illustrated by two real data case studies on workers' compensation and customer churn.
arXiv Detail & Related papers (2022-02-25T15:52:19Z)
Post-hoc Models for Performance Estimation of Machine Learning Inference [22.977047604404884]
Estimating how well a machine learning model performs during inference is critical in a variety of scenarios. We systematically generalize performance estimation to a diverse set of metrics and scenarios. We find that proposed post-hoc models consistently outperform the standard confidence baselines.
arXiv Detail & Related papers (2021-10-06T02:20:37Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.