Code Smells for Machine Learning Applications
- URL: http://arxiv.org/abs/2203.13746v1
- Date: Fri, 25 Mar 2022 16:23:02 GMT
- Title: Code Smells for Machine Learning Applications
- Authors: Haiyin Zhang, Lu\'is Cruz, Arie van Deursen
- Abstract summary: There is a lack of guidelines for code quality in machine learning applications.
This paper proposes and identifies a list of 22 machine learning-specific code smells collected from various sources.
We pinpoint each smell with a description of its context, potential issues in the long run, and proposed solutions.
- Score: 6.759291241573661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The popularity of machine learning has wildly expanded in recent years.
Machine learning techniques have been heatedly studied in academia and applied
in the industry to create business value. However, there is a lack of
guidelines for code quality in machine learning applications. In particular,
code smells have rarely been studied in this domain. Although machine learning
code is usually integrated as a small part of an overarching system, it usually
plays an important role in its core functionality. Hence ensuring code quality
is quintessential to avoid issues in the long run. This paper proposes and
identifies a list of 22 machine learning-specific code smells collected from
various sources, including papers, grey literature, GitHub commits, and Stack
Overflow posts. We pinpoint each smell with a description of its context,
potential issues in the long run, and proposed solutions. In addition, we link
them to their respective pipeline stage and the evidence from both academic and
grey literature. The code smell catalog helps data scientists and developers
produce and maintain high-quality machine learning application code.
Related papers
- RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph [63.87660059104077]
We present RepoGraph, a plug-in module that manages a repository-level structure for modern AI software engineering solutions.
RepoGraph substantially boosts the performance of all systems, leading to a new state-of-the-art among open-source frameworks.
arXiv Detail & Related papers (2024-10-03T05:45:26Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit [63.82016263181941]
Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora.
Currently, there is already a thriving research community focusing on code intelligence.
arXiv Detail & Related papers (2023-12-30T17:48:37Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - Deep Learning Based Code Generation Methods: Literature Review [30.17038624027751]
This paper focuses on Code Generation task that aims at generating relevant code fragments according to given natural language descriptions.
In this paper, we systematically review the current work on deep learning-based code generation methods.
arXiv Detail & Related papers (2023-03-02T08:25:42Z) - Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology.
The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z) - PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python [6.276936701568444]
Pykale is a Python library for knowledge-aware machine learning on graphs, images, texts, and videos.
We formulate new green machine learning guidelines based on standard software engineering practices.
We build PyKale on PyTorch and leverage the rich PyTorch ecosystem.
arXiv Detail & Related papers (2021-06-17T18:35:37Z) - Ten Quick Tips for Deep Learning in Biology [116.78436313026478]
Machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling.
Deep learning has become its own subfield of machine learning.
In the context of biological research, deep learning has been increasingly used to derive novel insights from high-dimensional biological data.
arXiv Detail & Related papers (2021-05-29T21:02:44Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - The Prevalence of Code Smells in Machine Learning projects [9.722159563454436]
static code analysis can be used to find potential defects in the source code, opportunities, and violations of common coding standards.
We gathered a dataset of 74 open-source Machine Learning projects, installed their dependencies and ran Pylint on them.
This resulted in a top 20 of all detected code smells, per category.
arXiv Detail & Related papers (2021-03-06T16:01:54Z) - Smoke Testing for Machine Learning: Simple Tests to Discover Severe
Defects [7.081604594416339]
We try to determine generic and simple smoke tests that can be used to assert that basic functions can be executed without crashing.
We were able to find bugs in all three machine learning libraries that we tested and severe bugs in two of the three libraries.
arXiv Detail & Related papers (2020-09-03T08:54:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.