Related papers: Quality issues in Machine Learning Software Systems

Quality issues in Machine Learning Software Systems

URL: http://arxiv.org/abs/2208.08982v2
Date: Mon, 22 Aug 2022 17:43:10 GMT
Title: Quality issues in Machine Learning Software Systems
Authors: Pierre-Olivier C\^ot\'e, Amin Nikanjam, Rached Bouchoucha, Foutse Khomh
Abstract summary: This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. We expect that the catalog of issues developed at this step will also help us later to identify the severity, root causes, and possible remedy for quality issues of MLSSs.
Score: 12.655311590103238
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Context: An increasing demand is observed in various domains to employ Machine Learning (ML) for solving complex problems. ML models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs). Problem: There is a strong need for ensuring the serving quality of MLSSs. False or poor decisions of such systems can lead to malfunction of other systems, significant financial losses, or even threat to human life. The quality assurance of MLSSs is considered as a challenging task and currently is a hot research topic. Moreover, it is important to cover all various aspects of the quality in MLSSs. Objective: This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. This empirical study aims to identify a catalog of bad-practices related to poor quality in MLSSs. Method: We plan to conduct a set of interviews with practitioners/experts, believing that interviews are the best method to retrieve their experience and practices when dealing with quality issues. We expect that the catalog of issues developed at this step will also help us later to identify the severity, root causes, and possible remedy for quality issues of MLSSs, allowing us to develop efficient quality assurance tools for ML models and MLSSs.

Related papers

An Empirical Exploration of ChatGPT's Ability to Support Problem Formulation Tasks for Mission Engineering and a Documentation of its Performance Variability [0.0]
This paper explores the quality and consistency of large language models (LLM) in supporting mission engineering problem formulation tasks. We identify a relevant reference problem, a NASA space mission design challenge, and document ChatGPT-3.5's ability to perform stakeholder identification tasks. We find that the LLM performs well in identifying human-focused stakeholders but poorly in recognizing external systems and environmental factors.
arXiv Detail & Related papers (2025-02-05T17:58:23Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. It is constructed using a dataset curated from 30 well-known GitHub repositories. We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage. CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers. Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z)
Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z)
Maintainability Challenges in ML: A Systematic Literature Review [5.669063174637433]
This study aims to identify and synthesise the maintainability challenges in different stages of the Machine Learning workflow. We screened more than 13000 papers, then selected and qualitatively analysed 56 of them.
arXiv Detail & Related papers (2024-08-17T13:24:15Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)
Towards Self-Adaptive Machine Learning-Enabled Systems Through QoS-Aware Model Switching [1.2277343096128712]
We propose the concept of a Machine Learning Model Balancer, focusing on managing uncertainties related to ML models by using multiple models. AdaMLS is a novel self-adaptation approach that leverages this concept and extends the traditional MAPE-K loop for continuous MLS adaptation. Preliminary results suggest AdaMLS surpasses naive and single state-of-the-art models in guarantees.
arXiv Detail & Related papers (2023-08-19T09:33:51Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
Quality Issues in Machine Learning Software Systems [10.103134260637402]
There is a strong need for ensuring the serving quality of Machine Learning Software Systems. This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. We identify 18 recurring quality issues and 21 strategies to mitigate them.
arXiv Detail & Related papers (2023-06-26T18:46:46Z)
How Can Recommender Systems Benefit from Large Language Models: A Survey [82.06729592294322]
Large language models (LLM) have shown impressive general intelligence and human-like capabilities. We conduct a comprehensive survey on this research direction from the perspective of the whole pipeline in real-world recommender systems.
arXiv Detail & Related papers (2023-06-09T11:31:50Z)
Quality Assurance Challenges for Machine Learning Software Applications During Software Development Life Cycle Phases [1.4213973379473654]
The paper conducts an in-depth review of literature on the quality assurance of Machine Learning (ML) models. We develop a taxonomy of MLSA quality assurance issues by mapping the various ML adoption challenges across different phases of software development life cycles (SDLC) This mapping can help prioritize quality assurance efforts of MLSAs where the adoption of ML models can be considered crucial.
arXiv Detail & Related papers (2021-05-03T22:29:23Z)
Towards Guidelines for Assessing Qualities of Machine Learning Systems [1.715032913622871]
This article presents the construction of a quality model for an ML system based on an industrial use case. In the future, we want to learn how the term quality differs between different types of ML systems.
arXiv Detail & Related papers (2020-08-25T13:45:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.