Related papers: The Prevalence of Code Smells in Machine Learning projects

The Prevalence of Code Smells in Machine Learning projects

URL: http://arxiv.org/abs/2103.04146v1
Date: Sat, 6 Mar 2021 16:01:54 GMT
Title: The Prevalence of Code Smells in Machine Learning projects
Authors: Bart van Oort, Lu\'is Cruz, Maur\'icio Aniche, Arie van Deursen
Abstract summary: static code analysis can be used to find potential defects in the source code, opportunities, and violations of common coding standards. We gathered a dataset of 74 open-source Machine Learning projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category.
Score: 9.722159563454436
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Artificial Intelligence (AI) and Machine Learning (ML) are pervasive in the current computer science landscape. Yet, there still exists a lack of software engineering experience and best practices in this field. One such best practice, static code analysis, can be used to find code smells, i.e., (potential) defects in the source code, refactoring opportunities, and violations of common coding standards. Our research set out to discover the most prevalent code smells in ML projects. We gathered a dataset of 74 open-source ML projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category. Manual analysis of these smells mainly showed that code duplication is widespread and that the PEP8 convention for identifier naming style may not always be applicable to ML code due to its resemblance with mathematical notation. More interestingly, however, we found several major obstructions to the maintainability and reproducibility of ML projects, primarily related to the dependency management of Python projects. We also found that Pylint cannot reliably check for correct usage of imported dependencies, including prominent ML libraries such as PyTorch.

Related papers

Automatic Identification of Machine Learning-Specific Code Smells [2.7825105949430293]
This work investigates suitable methods and tools to design and develop a static code analysis tool (MLpylint) based on code smell criteria.<n>We evaluated the tool on data from 160 open-source ML applications sourced from GitHub.<n>The results indicate the effectiveness and usefulness of the MLpylint.
arXiv Detail & Related papers (2025-08-04T15:51:15Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning [57.09163579304332]
We introduce PaperCoder, a framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations.
arXiv Detail & Related papers (2025-04-24T01:57:01Z)
How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study [45.126233498200534]
We introduce CodeSmellEval, a benchmark designed to evaluate the propensity of Large Language Models for generating code smells. Our benchmark includes a novel metric: Propensity Smelly Score (PSC), and a curated dataset of method-level code smells: CodeSmellData. To demonstrate the use of CodeSmellEval, we conducted a case study with two state-of-the-art LLMs, CodeLlama and Mistral.
arXiv Detail & Related papers (2024-12-25T21:56:35Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
Detecting Refactoring Commits in Machine Learning Python Projects: A Machine Learning-Based Approach [3.000496428347787]
MLRefScanner identifies commits with both ML-specific and general operations. Our study highlights the potential of ML-driven approaches in detecting programming across diverse languages and technical domains.
arXiv Detail & Related papers (2024-04-09T18:46:56Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z)
Julearn: an easy-to-use library for leakage-free evaluation and inspection of ML models [0.23301643766310373]
We present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects. Julearn aims to simplify the entry into the machine learning world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls.
arXiv Detail & Related papers (2023-10-19T08:21:12Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
The Good, the Bad, and the Missing: Neural Code Generation for Machine Learning Tasks [11.837851107416588]
This paper investigates the effectiveness of existing neural code generation models on Machine Learning programming tasks. We select six state-of-the-art neural code generation models, and evaluate their performance on four widely used ML libraries. Our empirical study reveals some good, bad, and missing aspects of neural code generation models on ML tasks.
arXiv Detail & Related papers (2023-05-16T00:52:02Z)
Prevalence of Code Smells in Reinforcement Learning Projects [1.7218973692320518]
Reinforcement Learning (RL) is being increasingly used to learn and adapt application behavior in many domains, including large-scale and safety critical systems. With the advent of plug-n-play RL libraries, its applicability has further increased, enabling integration of RL algorithms by users. We note, however, that the majority of such code is not developed by RL engineers, which as a consequence, may lead to poor program quality yielding bugs, suboptimal performance, maintainability, and evolution problems for RL-based projects.
arXiv Detail & Related papers (2023-03-17T20:25:13Z)
Code Smells for Machine Learning Applications [6.759291241573661]
There is a lack of guidelines for code quality in machine learning applications. This paper proposes and identifies a list of 22 machine learning-specific code smells collected from various sources. We pinpoint each smell with a description of its context, potential issues in the long run, and proposed solutions.
arXiv Detail & Related papers (2022-03-25T16:23:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.