Related papers: Code Smells for Machine Learning Applications

Code Smells for Machine Learning Applications

URL: http://arxiv.org/abs/2203.13746v1
Date: Fri, 25 Mar 2022 16:23:02 GMT
Title: Code Smells for Machine Learning Applications
Authors: Haiyin Zhang, Lu\'is Cruz, Arie van Deursen
Abstract summary: There is a lack of guidelines for code quality in machine learning applications. This paper proposes and identifies a list of 22 machine learning-specific code smells collected from various sources. We pinpoint each smell with a description of its context, potential issues in the long run, and proposed solutions.
Score: 6.759291241573661
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The popularity of machine learning has wildly expanded in recent years. Machine learning techniques have been heatedly studied in academia and applied in the industry to create business value. However, there is a lack of guidelines for code quality in machine learning applications. In particular, code smells have rarely been studied in this domain. Although machine learning code is usually integrated as a small part of an overarching system, it usually plays an important role in its core functionality. Hence ensuring code quality is quintessential to avoid issues in the long run. This paper proposes and identifies a list of 22 machine learning-specific code smells collected from various sources, including papers, grey literature, GitHub commits, and Stack Overflow posts. We pinpoint each smell with a description of its context, potential issues in the long run, and proposed solutions. In addition, we link them to their respective pipeline stage and the evidence from both academic and grey literature. The code smell catalog helps data scientists and developers produce and maintain high-quality machine learning application code.

Related papers

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning [57.09163579304332]
We introduce PaperCoder, a framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations.
arXiv Detail & Related papers (2025-04-24T01:57:01Z)
COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging [0.0]
COFO is a dataset consisting of 809 classes/output with a total of 369K source codes written in C, C++, Java, and Python programming languages. We envision that this dataset can be useful for solving machine learning-based problems like program classification/problems, tagging, predicting program properties, and code comprehension.
arXiv Detail & Related papers (2025-03-24T00:29:43Z)
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph [63.87660059104077]
We present RepoGraph, a plug-in module that manages a repository-level structure for modern AI software engineering solutions. RepoGraph substantially boosts the performance of all systems, leading to a new state-of-the-art among open-source frameworks.
arXiv Detail & Related papers (2024-10-03T05:45:26Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation. We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit [63.82016263181941]
Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora. Currently, there is already a thriving research community focusing on code intelligence.
arXiv Detail & Related papers (2023-12-30T17:48:37Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
Deep Learning Based Code Generation Methods: Literature Review [30.17038624027751]
This paper focuses on Code Generation task that aims at generating relevant code fragments according to given natural language descriptions. In this paper, we systematically review the current work on deep learning-based code generation methods.
arXiv Detail & Related papers (2023-03-02T08:25:42Z)
Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology. The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z)
PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python [6.276936701568444]
Pykale is a Python library for knowledge-aware machine learning on graphs, images, texts, and videos. We formulate new green machine learning guidelines based on standard software engineering practices. We build PyKale on PyTorch and leverage the rich PyTorch ecosystem.
arXiv Detail & Related papers (2021-06-17T18:35:37Z)
Ten Quick Tips for Deep Learning in Biology [116.78436313026478]
Machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling. Deep learning has become its own subfield of machine learning. In the context of biological research, deep learning has been increasingly used to derive novel insights from high-dimensional biological data.
arXiv Detail & Related papers (2021-05-29T21:02:44Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
The Prevalence of Code Smells in Machine Learning projects [9.722159563454436]
static code analysis can be used to find potential defects in the source code, opportunities, and violations of common coding standards. We gathered a dataset of 74 open-source Machine Learning projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category.
arXiv Detail & Related papers (2021-03-06T16:01:54Z)
Smoke Testing for Machine Learning: Simple Tests to Discover Severe Defects [7.081604594416339]
We try to determine generic and simple smoke tests that can be used to assert that basic functions can be executed without crashing. We were able to find bugs in all three machine learning libraries that we tested and severe bugs in two of the three libraries.
arXiv Detail & Related papers (2020-09-03T08:54:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.