Quality Model for Machine Learning Components
- URL: http://arxiv.org/abs/2602.05043v1
- Date: Wed, 04 Feb 2026 20:50:51 GMT
- Title: Quality Model for Machine Learning Components
- Authors: Grace A. Lewis, Rachel Brower-Sinning, Robert Edman, Ipek Ozkaya, Sebastián EcheverrÃa, Alex Derr, Collin Beaudoin, Katherine R. Maffey,
- Abstract summary: Testing is still largely limited to testing model properties, such as model performance, without considering requirements derived from the system it will be a part of.<n>A newer standard, ISO 25059, defines a more specific quality model for AI systems.<n>We present a quality model for ML components that serves as a guide for requirements elicitation and negotiation.
- Score: 3.654750616721868
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite increased adoption and advances in machine learning (ML), there are studies showing that many ML prototypes do not reach the production stage and that testing is still largely limited to testing model properties, such as model performance, without considering requirements derived from the system it will be a part of, such as throughput, resource consumption, or robustness. This limited view of testing leads to failures in model integration, deployment, and operations. In traditional software development, quality models such as ISO 25010 provide a widely used structured framework to assess software quality, define quality requirements, and provide a common language for communication with stakeholders. A newer standard, ISO 25059, defines a more specific quality model for AI systems. However, a problem with this standard is that it combines system attributes with ML component attributes, which is not helpful for a model developer, as many system attributes cannot be assessed at the component level. In this paper, we present a quality model for ML components that serves as a guide for requirements elicitation and negotiation and provides a common vocabulary for ML component developers and system stakeholders to agree on and define system-derived requirements and focus their testing efforts accordingly. The quality model was validated through a survey in which the participants agreed with its relevance and value. The quality model has been successfully integrated into an open-source tool for ML component testing and evaluation demonstrating its practical application.
Related papers
- Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs [60.0988889107102]
We explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs)<n>We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality that provides principles for the transformation.<n>We develop an agentic system (Q-Mirror) which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement.
arXiv Detail & Related papers (2025-09-29T05:22:10Z) - OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging [124.91183814854126]
Model merging seeks to combine multiple expert models into a single model.<n>We introduce a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation.<n>We find that model merging offers a promising way for building improved MLLMs without requiring training data.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - VideoGen-Eval: Agent-based System for Video Generation Evaluation [54.662739174367836]
Video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models.<n>We propose VideoGen-Eval, an agent evaluation system that integrates content structuring, MLLM-based content judgment, and patch tools for temporal-dense dimensions.<n>We introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system.
arXiv Detail & Related papers (2025-03-30T14:12:21Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Using Quality Attribute Scenarios for ML Model Test Case Generation [3.9111051646728527]
Current practice for machine learning (ML) model testing prioritizes testing for model performance.
This paper presents an approach based on quality attribute (QA) scenarios to elicit and define system- and model-relevant test cases.
The QA-based approach has been integrated into MLTE, a process and tool to support ML model test and evaluation.
arXiv Detail & Related papers (2024-06-12T18:26:42Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - MLTEing Models: Negotiating, Evaluating, and Documenting Model and
System Qualities [1.1352560842946413]
MLTE is a framework and implementation to evaluate machine learning models and systems.
It compiles state-of-the-art evaluation techniques into an organizational process.
MLTE tooling supports this process by providing a domain-specific language that teams can use to express model requirements.
arXiv Detail & Related papers (2023-03-03T15:10:38Z) - Mutation Testing framework for Machine Learning [0.0]
Failure of Machine Learning Models can lead to severe consequences in terms of loss of life or property.
Developers, scientists, and ML community around the world, must build a highly reliable test architecture for critical ML application.
This article provides an insight journey of Machine Learning Systems (MLS) testing, its evolution, current paradigm and future work.
arXiv Detail & Related papers (2021-02-19T18:02:31Z) - Towards Guidelines for Assessing Qualities of Machine Learning Systems [1.715032913622871]
This article presents the construction of a quality model for an ML system based on an industrial use case.
In the future, we want to learn how the term quality differs between different types of ML systems.
arXiv Detail & Related papers (2020-08-25T13:45:54Z) - Quantitatively Assessing the Benefits of Model-driven Development in
Agent-based Modeling and Simulation [80.49040344355431]
This paper compares the use of MDD and ABMS platforms in terms of effort and developer mistakes.
The obtained results show that MDD4ABMS requires less effort to develop simulations with similar (sometimes better) design quality than NetLogo.
arXiv Detail & Related papers (2020-06-15T23:29:04Z) - MLModelScope: A Distributed Platform for Model Evaluation and
Benchmarking at Scale [32.62513495487506]
Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them.
The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community.
This paper proposes MLModelScope, an open-source, framework/ hardware agnostic, and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
arXiv Detail & Related papers (2020-02-19T17:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.