Metrics for Benchmarking and Uncertainty Quantification: Quality,
Applicability, and a Path to Best Practices for Machine Learning in Chemistry
- URL: http://arxiv.org/abs/2010.00110v2
- Date: Fri, 22 Jan 2021 22:24:00 GMT
- Title: Metrics for Benchmarking and Uncertainty Quantification: Quality,
Applicability, and a Path to Best Practices for Machine Learning in Chemistry
- Authors: Gaurav Vishwakarma, Aditya Sonpal, Johannes Hachmann
- Abstract summary: This review aims to draw attention to two issues of concern when we set out to make machine learning benchmarking work in the chemical and materials domain.
They are often overlooked or underappreciated topics as chemists typically only have limited training in statistics.
These metrics are also key to comparing the performance of different models and thus for developing guidelines and best practices for the successful application of machine learning in chemistry.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This review aims to draw attention to two issues of concern when we set out
to make machine learning work in the chemical and materials domain, i.e.,
statistical loss function metrics for the validation and benchmarking of
data-derived models, and the uncertainty quantification of predictions made by
them. They are often overlooked or underappreciated topics as chemists
typically only have limited training in statistics. Aside from helping to
assess the quality, reliability, and applicability of a given model, these
metrics are also key to comparing the performance of different models and thus
for developing guidelines and best practices for the successful application of
machine learning in chemistry.
Related papers
- Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks [0.0]
We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks.
These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
arXiv Detail & Related papers (2025-01-08T02:17:34Z) - Developing a Dataset-Adaptive, Normalized Metric for Machine Learning Model Assessment: Integrating Size, Complexity, and Class Imbalance [0.0]
Traditional metrics like accuracy, F1-score, and precision are frequently used to evaluate machine learning models.
A dataset-adaptive, normalized metric that incorporates dataset characteristics like size, feature dimensionality, class imbalance, and signal-to-noise ratio is presented.
arXiv Detail & Related papers (2024-12-10T07:10:00Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - A Novel Metric for Measuring Data Quality in Classification Applications
(extended version) [0.0]
We introduce and explain a novel metric to measure data quality.
This metric is based on the correlated evolution between the classification performance and the deterioration of data.
We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z) - On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models.
We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process.
Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Model Comparison and Calibration Assessment: User Guide for Consistent
Scoring Functions in Machine Learning and Actuarial Practice [0.0]
This user guide revisits and clarifies statistical techniques to assess the calibration or adequacy of a model.
It focuses mainly on the pedagogical presentation of existing results and of best practice.
Results are accompanied and illustrated by two real data case studies on workers' compensation and customer churn.
arXiv Detail & Related papers (2022-02-25T15:52:19Z) - Post-hoc Models for Performance Estimation of Machine Learning Inference [22.977047604404884]
Estimating how well a machine learning model performs during inference is critical in a variety of scenarios.
We systematically generalize performance estimation to a diverse set of metrics and scenarios.
We find that proposed post-hoc models consistently outperform the standard confidence baselines.
arXiv Detail & Related papers (2021-10-06T02:20:37Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.