EnseSmells: Deep ensemble and programming language models for automated code smells detection
- URL: http://arxiv.org/abs/2502.05012v1
- Date: Fri, 07 Feb 2025 15:35:19 GMT
- Title: EnseSmells: Deep ensemble and programming language models for automated code smells detection
- Authors: Anh Ho, Anh M. T. Bui, Phuong T. Nguyen, Amleto Di Salle, Bach Le,
- Abstract summary: A smell in software source code denotes an indication of suboptimal design and implementation decisions.
This paper proposes a novel approach to code smell detection, constructing a deep learning architecture that places importance on the fusion of structural features and statistical semantics.
- Score: 3.974095344344234
- License:
- Abstract: A smell in software source code denotes an indication of suboptimal design and implementation decisions, potentially hindering the code understanding and, in turn, raising the likelihood of being prone to changes and faults. Identifying these code issues at an early stage in the software development process can mitigate these problems and enhance the overall quality of the software. Current research primarily focuses on the utilization of deep learning-based models to investigate the contextual information concealed within source code instructions to detect code smells, with limited attention given to the importance of structural and design-related features. This paper proposes a novel approach to code smell detection, constructing a deep learning architecture that places importance on the fusion of structural features and statistical semantics derived from pre-trained models for programming languages. We further provide a thorough analysis of how different source code embedding models affect the detection performance with respect to different code smell types. Using four widely-used code smells from well-designed datasets, our empirical study shows that incorporating design-related features significantly improves detection accuracy, outperforming state-of-the-art methods on the MLCQ dataset with with improvements ranging from 5.98% to 28.26%, depending on the type of code smell.
Related papers
- How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study [45.126233498200534]
We introduce CodeSmellEval, a benchmark designed to evaluate the propensity of Large Language Models for generating code smells.
Our benchmark includes a novel metric: Propensity Smelly Score (PSC), and a curated dataset of method-level code smells: CodeSmellData.
To demonstrate the use of CodeSmellEval, we conducted a case study with two state-of-the-art LLMs, CodeLlama and Mistral.
arXiv Detail & Related papers (2024-12-25T21:56:35Z) - Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub.
83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z) - Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models [12.959392500354223]
We pioneer the transfer of knowledge from pre-trained code generation models to code understanding tasks.
We introduce CL4D, a contrastive learning method designed to enhance the representation capabilities of decoder-only models.
arXiv Detail & Related papers (2024-06-18T06:52:14Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z) - An Empirical Study on Predictability of Software Code Smell Using Deep
Learning Models [3.2973778921083357]
Code smell is a surface indication of something tainted but in terms of software writing practices.
Recent studies have observed that codes having code smells are often prone to a higher probability of change in the software development cycle.
We developed code smell prediction models with the help of features extracted from source code to predict eight types of code smell.
arXiv Detail & Related papers (2021-08-08T12:36:23Z) - Empirical Analysis on Effectiveness of NLP Methods for Predicting Code
Smell [3.2973778921083357]
A code smell is a surface indicator of an inherent problem in the system.
We use three Extreme learning machine kernels over 629 packages to identify eight code smells.
Our findings indicate that the radial basis functional kernel performs best out of the three kernel methods with a mean accuracy of 98.52.
arXiv Detail & Related papers (2021-08-08T12:10:20Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.