Related papers: Defect Prediction with Content-based Features

Defect Prediction with Content-based Features

URL: http://arxiv.org/abs/2409.18365v1
Date: Fri, 27 Sep 2024 00:49:27 GMT
Title: Defect Prediction with Content-based Features
Authors: Hung Viet Pham, Tung Thanh Nguyen,
Abstract summary: Traditional defect prediction approaches often use metrics that measure the complexity of the design or implementing code of a software system. In this paper, we explore a different approach based on content of source code.
Score: 3.765563438775143
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional defect prediction approaches often use metrics that measure the complexity of the design or implementing code of a software system, such as the number of lines of code in a source file. In this paper, we explore a different approach based on content of source code. Our key assumption is that source code of a software system contains information about its technical aspects and those aspects might have different levels of defect-proneness. Thus, content-based features such as words, topics, data types, and package names extracted from a source code file could be used to predict its defects. We have performed an extensive empirical evaluation and found that: i) such content-based features have higher predictive power than code complexity metrics and ii) the use of feature selection, reduction, and combination further improves the prediction performance.

Related papers

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [47.17755403213469]
We propose CodeI/O, a novel approach that condenses diverse reasoning patterns embedded in contextually-grounded codes. By training models to predict inputs/outputs given code and test cases entirely in natural language, we expose them to universal reasoning primitives. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks.
arXiv Detail & Related papers (2025-02-11T07:26:50Z)
EnseSmells: Deep ensemble and programming language models for automated code smells detection [3.974095344344234]
A smell in software source code denotes an indication of suboptimal design and implementation decisions. This paper proposes a novel approach to code smell detection, constructing a deep learning architecture that places importance on the fusion of structural features and statistical semantics.
arXiv Detail & Related papers (2025-02-07T15:35:19Z)
Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub. 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z)
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z)
Towards Understanding the Impact of Code Modifications on Software Quality Metrics [1.2277343096128712]
This study aims to assess and interpret the impact of code modifications on software quality metrics. The underlying hypothesis posits that code modifications inducing similar changes in software quality metrics can be grouped into distinct clusters. The results reveal distinct clusters of code modifications, each accompanied by a concise description, revealing their collective impact on software quality metrics.
arXiv Detail & Related papers (2024-04-05T08:41:18Z)
Enhancing Source Code Representations for Deep Learning with Static Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models. We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns. Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z)
BAFLineDP: Code Bilinear Attention Fusion Framework for Line-Level Defect Prediction [0.0]
This paper presents a line-level defect prediction method grounded in a code bilinear attention fusion framework (BAFLineDP) Our results demonstrate that BAFLineDP outperforms current advanced file-level and line-level defect prediction approaches.
arXiv Detail & Related papers (2024-02-11T09:01:42Z)
Encoding Version History Context for Better Code Representation [13.045078976464307]
This paper presents preliminary evidence of the potential benefit of encoding contextual information from the version history to predict code clones and perform code classification. To ensure the technique performs consistently, we need to conduct a holistic investigation on a larger code base using different combinations of contexts, aggregation, and models.
arXiv Detail & Related papers (2024-02-06T07:35:36Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
Investigating the Impact of Vocabulary Difficulty and Code Naturalness on Program Comprehension [3.35803394416914]
This study aims to assess readability and understandability from the perspective of language acquisition. We will conduct a statistical analysis to understand their correlations and analyze whether code naturalness and vocabulary difficulty can be used to improve the performance of readability and understandability prediction methods.
arXiv Detail & Related papers (2023-08-25T15:15:00Z)
Language Model Decoding as Likelihood-Utility Alignment [54.70547032876017]
We introduce a taxonomy that groups decoding strategies based on their implicit assumptions about how well the model's likelihood is aligned with the task-specific notion of utility. Specifically, by analyzing the correlation between the likelihood and the utility of predictions across a diverse set of tasks, we provide the first empirical evidence supporting the proposed taxonomy.
arXiv Detail & Related papers (2022-10-13T17:55:51Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens. We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.