Lessons from the trenches on evaluating machine-learning systems in materials science
- URL: http://arxiv.org/abs/2503.10837v1
- Date: Thu, 13 Mar 2025 19:40:58 GMT
- Title: Lessons from the trenches on evaluating machine-learning systems in materials science
- Authors: Nawaf Alampara, Mara Schilling-Wilhelmi, Kevin Maik Jablonka,
- Abstract summary: We examine the current state and future directions of evaluation frameworks for machine learning in science.<n>We identify challenges common across machine learning evaluation such as construct validity, data quality issues, metric design limitations, and benchmark maintenance problems.<n>We propose evaluation cards as a structured approach to documenting measurement choices and limitations.
- Score: 0.3592274960837379
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Measurements are fundamental to knowledge creation in science, enabling consistent sharing of findings and serving as the foundation for scientific discovery. As machine learning systems increasingly transform scientific fields, the question of how to effectively evaluate these systems becomes crucial for ensuring reliable progress. In this review, we examine the current state and future directions of evaluation frameworks for machine learning in science. We organize the review around a broadly applicable framework for evaluating machine learning systems through the lens of statistical measurement theory, using materials science as our primary context for examples and case studies. We identify key challenges common across machine learning evaluation such as construct validity, data quality issues, metric design limitations, and benchmark maintenance problems that can lead to phantom progress when evaluation frameworks fail to capture real-world performance needs. By examining both traditional benchmarks and emerging evaluation approaches, we demonstrate how evaluation choices fundamentally shape not only our measurements but also research priorities and scientific progress. These findings reveal the critical need for transparency in evaluation design and reporting, leading us to propose evaluation cards as a structured approach to documenting measurement choices and limitations. Our work highlights the importance of developing a more diverse toolbox of evaluation techniques for machine learning in materials science, while offering insights that can inform evaluation practices in other scientific domains where similar challenges exist.
Related papers
- Toward an Evaluation Science for Generative AI Systems [22.733049816407114]
We advocate for maturing an evaluation science for generative AI systems.<n>In particular, we present three key lessons: Evaluation metrics must be applicable to real-world performance, metrics must be iteratively refined, and evaluation institutions and norms must be established.
arXiv Detail & Related papers (2025-03-07T11:23:48Z) - Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We present a framework for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems.
The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves.
arXiv Detail & Related papers (2024-11-17T02:35:30Z) - Could Bibliometrics Reveal Top Science and Technology Achievements and Researchers? The Case for Evaluatology-based Science and Technology Evaluation [5.203905488272949]
We present an evaluatology-based science and technology evaluation methodology.
At the heart of this approach lies the concept of an extended evaluation condition, encompassing eight crucial components derived from a field.
Within a specific field like chip technology or open source, we construct a perfect evaluation model that can accurately trace the evolution and development of all achievements.
arXiv Detail & Related papers (2024-08-22T06:57:46Z) - Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition [70.60872754129832]
First NeurIPS competition on unlearning sought to stimulate the development of novel algorithms.
Nearly 1,200 teams from across the world participated.
We analyze top solutions and delve into discussions on benchmarking unlearning.
arXiv Detail & Related papers (2024-06-13T12:58:00Z) - AI and Machine Learning for Next Generation Science Assessments [0.7416846035207727]
This chapter focuses on the transformative role of Artificial Intelligence (AI) and Machine Learning (ML) in science assessments.
The paper begins with a discussion of the Framework for K-12 Science Education, which calls for a shift from conceptual learning to knowledge-in-use.
The paper achieves three major goals: reviewing the current state of ML-based assessments in science education, introducing a framework for scoring accuracy in ML-based automatic assessments, and discussing future directions and challenges.
arXiv Detail & Related papers (2024-04-23T01:39:20Z) - Evaluatology: The Science and Engineering of Evaluation [11.997673313601423]
This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation.
We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines.
arXiv Detail & Related papers (2024-03-19T13:38:26Z) - Image Quality Assessment in the Modern Age [53.19271326110551]
This tutorial provides the audience with the basic theories, methodologies, and current progresses of image quality assessment (IQA)
We will first revisit several subjective quality assessment methodologies, with emphasis on how to properly select visual stimuli.
Both hand-engineered and (deep) learning-based methods will be covered.
arXiv Detail & Related papers (2021-10-19T02:38:46Z) - Physics-Informed Deep Learning: A Promising Technique for System
Reliability Assessment [1.847740135967371]
There is limited study on the utilization of deep learning for system reliability assessment.
We present an approach to frame system reliability assessment in the context of physics-informed deep learning.
The proposed approach is demonstrated by three numerical examples involving a dual-processor computing system.
arXiv Detail & Related papers (2021-08-24T16:24:46Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z) - Through the Data Management Lens: Experimental Analysis and Evaluation
of Fair Classification [75.49600684537117]
Data management research is showing an increasing presence and interest in topics related to data and algorithmic fairness.
We contribute a broad analysis of 13 fair classification approaches and additional variants, over their correctness, fairness, efficiency, scalability, and stability.
Our analysis highlights novel insights on the impact of different metrics and high-level approach characteristics on different aspects of performance.
arXiv Detail & Related papers (2021-01-18T22:55:40Z) - Interpretable Off-Policy Evaluation in Reinforcement Learning by
Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education.
Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding.
We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.