A Systematic Survey on Debugging Techniques for Machine Learning Systems
- URL: http://arxiv.org/abs/2503.03158v1
- Date: Wed, 05 Mar 2025 03:57:20 GMT
- Title: A Systematic Survey on Debugging Techniques for Machine Learning Systems
- Authors: Thanh-Dat Nguyen, Haoye Tian, Bach Le, Patanamon Thongtanunam, Shane McIntosh,
- Abstract summary: Machine learning (ML) software poses unique challenges compared to traditional software.<n>Various methods have been proposed for testing, diagnosing, and repairing ML systems.<n>However, the big picture informing important research directions that fulfill developers needs is yet to unfold.
- Score: 5.747738795689893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Debugging ML software (i.e., the detection, localization and fixing of faults) poses unique challenges compared to traditional software largely due to the probabilistic nature and heterogeneity of its development process. Various methods have been proposed for testing, diagnosing, and repairing ML systems. However, the big picture informing important research directions that really address the dire needs of developers is yet to unfold, leaving several key questions unaddressed: (1) What faults have been targeted in the ML debugging research that fulfill developers needs in practice? (2) How are these faults addressed? (3) What are the challenges in addressing the yet untargeted faults? In this paper, we conduct a systematic study of debugging techniques for machine learning systems. We first collect technical papers focusing on debugging components in machine learning software. We then map these papers to a taxonomy of faults to assess the current state of fault resolution identified in existing literature. Subsequently, we analyze which techniques are used to address specific faults based on the collected papers. This results in a comprehensive taxonomy that aligns faults with their corresponding debugging methods. Finally, we examine previously released transcripts of interviewing developers to identify the challenges in resolving unfixed faults. Our analysis reveals that only 48 percent of the identified ML debugging challenges have been explicitly addressed by researchers, while 46.9 percent remain unresolved or unmentioned. In real world applications, we found that 52.6 percent of issues reported on GitHub and 70.3% of problems discussed in interviews are still unaddressed by research in ML debugging. The study identifies 13 primary challenges in ML debugging, highlighting a significant gap between the identification of ML debugging issues and their resolution in practice.
Related papers
- Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.
We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.
The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z) - Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [22.03052751722933]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads.
We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - Reproducibility of Issues Reported in Stack Overflow Questions: Challenges, Impact & Estimation [2.2160604288512324]
Software developers often submit questions to technical Q&A sites like Stack Overflow (SO) to resolve code-level problems.
In practice, they include example code snippets with questions to explain the programming issues.
Unfortunately, such code snippets could not always reproduce the issues due to several unmet challenges.
arXiv Detail & Related papers (2024-07-13T22:55:35Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.
The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.
We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - Leveraging Print Debugging to Improve Code Generation in Large Language
Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks.
But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal.
We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z) - Progressing from Anomaly Detection to Automated Log Labeling and
Pioneering Root Cause Analysis [53.24804865821692]
This study introduces a taxonomy for log anomalies and explores automated data labeling to mitigate labeling challenges.
The study envisions a future where root cause analysis follows anomaly detection, unraveling the underlying triggers of anomalies.
arXiv Detail & Related papers (2023-12-22T15:04:20Z) - A Survey on Automated Software Vulnerability Detection Using Machine
Learning and Deep Learning [19.163031235081565]
Machine Learning (ML) and Deep Learning (DL) based models for detecting vulnerabilities in source code have been presented in recent years.
It may be difficult to discover gaps in existing research and potential for future improvement without a comprehensive survey.
This work address that gap by presenting a systematic survey to characterize various features of ML/DL-based source code level software vulnerability detection approaches.
arXiv Detail & Related papers (2023-06-20T16:51:59Z) - On building machine learning pipelines for Android malware detection: a
procedural survey of practices, challenges and opportunities [4.8460847676785175]
As the smartphone market leader, Android has been a prominent target for malware attacks.
For market holders and researchers, in particular, the large number of samples has made manual malware detection unfeasible.
While some of the proposed approaches achieve high performance, rapidly evolving Android malware has made them unable to maintain their accuracy over time.
arXiv Detail & Related papers (2023-06-12T13:52:28Z) - Revelio: ML-Generated Debugging Queries for Distributed Systems [0.0]
Revelio takes user reports and system logs as input, and outputs queries that developers can use to find a bug's root cause.
It employs deep neural networks to uniformly embed diverse input sources and potential queries into a high-dimensional vector space.
We show that Revelio includes the most helpful query in its predicted list of top-3 relevant queries 96% of the time.
arXiv Detail & Related papers (2021-06-28T00:23:21Z) - Understanding the Usability Challenges of Machine Learning In
High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains.
In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions.
We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z) - Automatic Feasibility Study via Data Quality Analysis for ML: A
Case-Study on Label Noise [21.491392581672198]
We present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study.
We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER)
We demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.
arXiv Detail & Related papers (2020-10-16T14:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.