Related papers: Quality Issues in Machine Learning Software Systems

Quality Issues in Machine Learning Software Systems

URL: http://arxiv.org/abs/2306.15007v2
Date: Sat, 3 Aug 2024 14:52:22 GMT
Title: Quality Issues in Machine Learning Software Systems
Authors: Pierre-Olivier Côté, Amin Nikanjam, Rached Bouchoucha, Ilan Basta, Mouna Abidi, Foutse Khomh,
Abstract summary: There is a strong need for ensuring the serving quality of Machine Learning Software Systems. This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. We identify 18 recurring quality issues and 21 strategies to mitigate them.
Score: 10.103134260637402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Context: An increasing demand is observed in various domains to employ Machine Learning (ML) for solving complex problems. ML models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs). Problem: There is a strong need for ensuring the serving quality of MLSSs. False or poor decisions of such systems can lead to malfunction of other systems, significant financial losses, or even threats to human life. The quality assurance of MLSSs is considered a challenging task and currently is a hot research topic. Objective: This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. This empirical study aims to identify a catalog of quality issues in MLSSs. Method: We conduct a set of interviews with practitioners/experts, to gather insights about their experience and practices when dealing with quality issues. We validate the identified quality issues via a survey with ML practitioners. Results: Based on the content of 37 interviews, we identified 18 recurring quality issues and 21 strategies to mitigate them. For each identified issue, we describe the causes and consequences according to the practitioners' experience. Conclusion: We believe the catalog of issues developed in this study will allow the community to develop efficient quality assurance tools for ML models and MLSSs. A replication package of our study is available on our public GitHub repository

Related papers

A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. It is constructed using a dataset curated from 30 well-known GitHub repositories. We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage. CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers. Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z)
Demystifying Issues, Causes and Solutions in LLM Open-Source Projects [15.881912703104376]
We conducted an empirical study to understand the issues that practitioners encounter when developing and using LLM open-source software. We collected all closed issues from 15 LLM open-source projects and labelled issues that met our requirements. Our study results show that Model Issue is the most common issue faced by practitioners.
arXiv Detail & Related papers (2024-09-25T02:16:45Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
Maintainability Challenges in ML: A Systematic Literature Review [5.669063174637433]
This study aims to identify and synthesise the maintainability challenges in different stages of the Machine Learning workflow. We screened more than 13000 papers, then selected and qualitatively analysed 56 of them.
arXiv Detail & Related papers (2024-08-17T13:24:15Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)
Status Quo and Problems of Requirements Engineering for Machine Learning: Results from an International Survey [7.164324501049983]
Requirements Engineering (RE) can help address many problems when engineering Machine Learning-enabled systems. We conducted a survey to gather practitioner insights into the status quo and problems of RE in ML-enabled systems. We found significant differences in RE practices within ML projects.
arXiv Detail & Related papers (2023-10-10T15:53:50Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
Quality issues in Machine Learning Software Systems [12.655311590103238]
This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. We expect that the catalog of issues developed at this step will also help us later to identify the severity, root causes, and possible remedy for quality issues of MLSSs.
arXiv Detail & Related papers (2022-08-18T17:55:18Z)
Quality Assurance Challenges for Machine Learning Software Applications During Software Development Life Cycle Phases [1.4213973379473654]
The paper conducts an in-depth review of literature on the quality assurance of Machine Learning (ML) models. We develop a taxonomy of MLSA quality assurance issues by mapping the various ML adoption challenges across different phases of software development life cycles (SDLC) This mapping can help prioritize quality assurance efforts of MLSAs where the adoption of ML models can be considered crucial.
arXiv Detail & Related papers (2021-05-03T22:29:23Z)
Understanding the Usability Challenges of Machine Learning In High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains. In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions. We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.