Software Code Quality Measurement: Implications from Metric
Distributions
- URL: http://arxiv.org/abs/2307.12082v4
- Date: Tue, 16 Jan 2024 11:32:21 GMT
- Title: Software Code Quality Measurement: Implications from Metric
Distributions
- Authors: Siyuan Jin, Mianmian Zhang, Yekai Guo, Yuejiang He, Ziyuan Li, Bichao
Chen, Bing Zhu, and Yong Xia
- Abstract summary: We categorized distinct metrics into two types: 1) monotonic metrics that consistently influence code quality; and 2) non-monotonic metrics that lack a consistent relationship with code quality.
Our work contributes to the multi-dimensional construct of code quality and its metric measurements, which provides practical implications for consistent measurements on both monotonic and non-monotonic metrics.
- Score: 6.110201315596897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Software code quality is a construct with three dimensions: maintainability,
reliability, and functionality. Although many firms have incorporated code
quality metrics in their operations, evaluating these metrics still lacks
consistent standards. We categorized distinct metrics into two types: 1)
monotonic metrics that consistently influence code quality; and 2)
non-monotonic metrics that lack a consistent relationship with code quality. To
consistently evaluate them, we proposed a distribution-based method to get
metric scores. Our empirical analysis includes 36,460 high-quality open-source
software (OSS) repositories and their raw metrics from SonarQube and CK. The
evaluated scores demonstrate great explainability on software adoption. Our
work contributes to the multi-dimensional construct of code quality and its
metric measurements, which provides practical implications for consistent
measurements on both monotonic and non-monotonic metrics.
Related papers
- Evaluating Source Code Quality with Large Languagem Models: a comparative study [2.3204178451683264]
This paper describes and analyzes the results obtained using Large Language Model (LLM) as a static analysis tool.
A total of 1,641 classes were analyzed, comparing the results in two versions of models: GPT 3.5 Turbo and GPT 4o.
The GPT 4o version did not present the same results, diverging from the previous model and Sonar by assigning a high classification to codes that were assessed as lower quality.
arXiv Detail & Related papers (2024-08-07T18:44:46Z) - CodeScore-R: An Automated Robustness Metric for Assessing the FunctionalCorrectness of Code Synthesis [17.747095451792084]
We propose an automated robust metric, called CodeScore-R, for evaluating the functionality of code synthesis.
In the tasks of code generation and migration in Java and Python, CodeScore-R outperforms other metrics.
arXiv Detail & Related papers (2024-06-11T02:51:17Z) - Towards Understanding the Impact of Code Modifications on Software Quality Metrics [1.2277343096128712]
This study aims to assess and interpret the impact of code modifications on software quality metrics.
The underlying hypothesis posits that code modifications inducing similar changes in software quality metrics can be grouped into distinct clusters.
The results reveal distinct clusters of code modifications, each accompanied by a concise description, revealing their collective impact on software quality metrics.
arXiv Detail & Related papers (2024-04-05T08:41:18Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Free Open Source Communities Sustainability: Does It Make a Difference
in Software Quality? [2.981092370528753]
This study aims to empirically explore how the different aspects of sustainability impact software quality.
16 sustainability metrics across four categories were sampled and applied to a set of 217 OSS projects.
arXiv Detail & Related papers (2024-02-10T09:37:44Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep
Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks.
Each baseline is a self-contained experiment pipeline with easily reusable and extendable components.
We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.