Related papers: A systematic literature review on the code smells datasets and validation mechanisms

A systematic literature review on the code smells datasets and validation mechanisms

URL: http://arxiv.org/abs/2306.01377v1
Date: Fri, 2 Jun 2023 08:57:31 GMT
Title: A systematic literature review on the code smells datasets and validation mechanisms
Authors: Morteza Zakeri-Nasrabadi and Saeed Parsa and Ehsan Esmaili and Fabio Palomba
Abstract summary: A survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties. Most existing datasets support God Class, Long Method, and Feature Envy while six smells in Fowler and Beck's catalog are not supported by any datasets.
Score: 13.359901661369236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The accuracy reported for code smell-detecting tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as the size, severity level, project types, number of each type of smell, number of smells, and the ratio of smelly to non-smelly samples in the dataset. Most existing datasets support God Class, Long Method, and Feature Envy while six smells in Fowler and Beck's catalog are not supported by any datasets. We conclude that existing datasets suffer from imbalanced samples, lack of supporting severity level, and restriction to Java language.

Related papers

Anno-incomplete Multi-dataset Detection [67.69438032767613]
We propose a novel problem as "-incomplete Multi-dataset Detection" We develop an end-to-end multi-task learning architecture which can accurately detect all the object categories with multiple partially annotated datasets.
arXiv Detail & Related papers (2024-08-29T03:58:21Z)
Bayesian Detector Combination for Object Detection with Crowdsourced Annotations [49.43709660948812]
Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise. We propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations. BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models.
arXiv Detail & Related papers (2024-07-10T18:00:54Z)
Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models [25.893228797735908]
This study focuses on the credibility of real-world datasets, including the popular benchmarks Jigsaw Civil Comments, Anthropic Harmless & Red Team, PKU BeaverTails & SafeRLHF. Given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets. We find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks.
arXiv Detail & Related papers (2023-11-19T02:34:12Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
DACOS-A Manually Annotated Dataset of Code Smells [4.753388560240438]
We present DACOS, a manually annotated dataset containing 10,267 annotations for 5,192 code snippets. The dataset targets three kinds of code smells at different granularity: multifaceted abstraction, complex method, and long parameter list. We have developed TagMan, a web application to help annotators view and mark the snippets one-by-one and record the provided annotations.
arXiv Detail & Related papers (2023-03-15T16:13:40Z)
Analyzing Fairness in Deepfake Detection With Massively Annotated Databases [9.407035514709293]
We investigate factors causing biased detection in public Deepfake datasets. We create large-scale demographic and non-demographic annotations with 47 different attributes for five popular Deepfake datasets. We analyse attributes resulting in AI-bias of three state-of-the-art Deepfake detection backbone models on these datasets.
arXiv Detail & Related papers (2022-08-11T14:28:21Z)
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned. It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z)
Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets. We define a uniform evaluation setup including a new formalization of the annotation error detection task. We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z)
Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier. An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.