Related papers: An Empirical Study on Predictability of Software Code Smell Using Deep Learning Models

An Empirical Study on Predictability of Software Code Smell Using Deep Learning Models

URL: http://arxiv.org/abs/2108.04659v1
Date: Sun, 8 Aug 2021 12:36:23 GMT
Title: An Empirical Study on Predictability of Software Code Smell Using Deep Learning Models
Authors: Himanshu Gupta, Tanmay G. Kulkarni, Lov Kumar, Lalita Bhanu Murthy Neti and Aneesh Krishna
Abstract summary: Code smell is a surface indication of something tainted but in terms of software writing practices. Recent studies have observed that codes having code smells are often prone to a higher probability of change in the software development cycle. We developed code smell prediction models with the help of features extracted from source code to predict eight types of code smell.
Score: 3.2973778921083357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code Smell, similar to a bad smell, is a surface indication of something tainted but in terms of software writing practices. This metric is an indication of a deeper problem lies within the code and is associated with an issue which is prominent to experienced software developers with acceptable coding practices. Recent studies have often observed that codes having code smells are often prone to a higher probability of change in the software development cycle. In this paper, we developed code smell prediction models with the help of features extracted from source code to predict eight types of code smell. Our work also presents the application of data sampling techniques to handle class imbalance problem and feature selection techniques to find relevant feature sets. Previous studies had made use of techniques such as Naive - Bayes and Random forest but had not explored deep learning methods to predict code smell. A total of 576 distinct Deep Learning models were trained using the features and datasets mentioned above. The study concluded that the deep learning models which used data from Synthetic Minority Oversampling Technique gave better results in terms of accuracy, AUC with the accuracy of some models improving from 88.47 to 96.84.

Related papers

Towards Automated Detection of Inline Code Comment Smells [2.2134505920972547]
We aim to automatically detect and classify inline code comment smells through machine learning (ML) models and a large language model (LLM) In parallel, we trained and tested seven different machine learning algorithms on the augmented dataset to compare their classification performance against GPT 4. The performance of models, particularly Random Forest, which achieved an overall accuracy of 69 percent, establishes a solid baseline for future research in this domain.
arXiv Detail & Related papers (2025-04-26T15:38:14Z)
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.15402517835137]
We build a supervised fine-tuning (SFT) dataset to achieve state-of-the-art coding capability results in models of various sizes. Our models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning.
arXiv Detail & Related papers (2025-04-02T17:50:31Z)
EnseSmells: Deep ensemble and programming language models for automated code smells detection [3.974095344344234]
A smell in software source code denotes an indication of suboptimal design and implementation decisions. This paper proposes a novel approach to code smell detection, constructing a deep learning architecture that places importance on the fusion of structural features and statistical semantics.
arXiv Detail & Related papers (2025-02-07T15:35:19Z)
ChatGPT Code Detection: Techniques for Uncovering the Source of Code [0.0]
We use advanced classification techniques to differentiate between code written by humans and that generated by ChatGPT. We employ a new approach that combines powerful embedding features (black-box) with supervised learning algorithms. We show that untrained humans solve the same task not better than random guessing.
arXiv Detail & Related papers (2024-05-24T12:56:18Z)
Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models. We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks. We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Rethinking Negative Pairs in Code Search [56.23857828689406]
We propose a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. We analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation.
arXiv Detail & Related papers (2023-10-12T06:32:42Z)
Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion. For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z)
Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost. We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z)
PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z)
Towards Using Data-Centric Approach for Better Code Representation Learning [1.1470070927586016]
We focus on improving existing code learning models from the data-centric perspective. We use a so-called data-influence method to identify noisy samples of pre-trained code learning models.
arXiv Detail & Related papers (2022-05-25T19:19:21Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
What do pre-trained code models know about code? [9.60966128833701]
We use diagnostic tasks called probes to investigate pre-trained code models. BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow) are investigated.
arXiv Detail & Related papers (2021-08-25T16:20:17Z)
Empirical Analysis on Effectiveness of NLP Methods for Predicting Code Smell [3.2973778921083357]
A code smell is a surface indicator of an inherent problem in the system. We use three Extreme learning machine kernels over 629 packages to identify eight code smells. Our findings indicate that the radial basis functional kernel performs best out of the three kernel methods with a mean accuracy of 98.52.
arXiv Detail & Related papers (2021-08-08T12:10:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.