Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3
- URL: http://arxiv.org/abs/2504.16027v1
- Date: Tue, 22 Apr 2025 16:44:39 GMT
- Title: Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3
- Authors: Ahmed R. Sadik, Siddhata Govind,
- Abstract summary: This study introduces a structured methodology and evaluation matrix to tackle this issue.<n>The dataset spans four prominent programming languages Java, Python, JavaScript, and C++.<n>We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection
Related papers
- Towards Automated Detection of Inline Code Comment Smells [2.2134505920972547]
We aim to automatically detect and classify inline code comment smells through machine learning (ML) models and a large language model (LLM)
In parallel, we trained and tested seven different machine learning algorithms on the augmented dataset to compare their classification performance against GPT 4.
The performance of models, particularly Random Forest, which achieved an overall accuracy of 69 percent, establishes a solid baseline for future research in this domain.
arXiv Detail & Related papers (2025-04-26T15:38:14Z) - Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting [86.15347226865826]
We design a new end-to-end object-aware lifting approach, named Unified-Lift.<n>We augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information.<n>We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms.
arXiv Detail & Related papers (2025-03-18T08:42:23Z) - Leveraging Large Language Models to Detect npm Malicious Packages [4.479741014073169]
This study empirically studies the effectiveness of Large Language Models (LLMs) in detecting malicious code.<n>We present SocketAI, a malicious code review workflow to detect malicious code.
arXiv Detail & Related papers (2024-03-18T19:10:12Z) - LLMDFA: Analyzing Dataflow in Code with Large Language Models [8.92611389987991]
This paper presents LLMDFA, a compilation-free and customizable dataflow analysis framework.
We decompose the problem into several subtasks and introduce a series of novel strategies.
On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35.
arXiv Detail & Related papers (2024-02-16T15:21:35Z) - Let's Discover More API Relations: A Large Language Model-based AI Chain
for Unsupervised API Relation Inference [19.05884373802318]
We propose utilizing large language models (LLMs) as a neural knowledge base for API relation inference.
This approach leverages the entire Web used to pre-train LLMs as a knowledge base and is insensitive to the context and complexity of input texts.
We achieve an average F1 value of 0.76 under the three datasets, significantly higher than the state-of-the-art method's average F1 value of 0.40.
arXiv Detail & Related papers (2023-11-02T14:25:00Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - V-DETR: DETR with Vertex Relative Position Encoding for 3D Object
Detection [73.37781484123536]
We introduce a highly performant 3D object detector for point clouds using the DETR framework.
To address the limitation, we introduce a novel 3D Relative Position (3DV-RPE) method.
We show exceptional results on the challenging ScanNetV2 benchmark.
arXiv Detail & Related papers (2023-08-08T17:14:14Z) - KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection [48.66703222700795]
We resort to a novel kernel strategy to identify the most informative point clouds to acquire labels.
To accommodate both one-stage (i.e., SECOND) and two-stage detectors, we incorporate the classification entropy tangent and well trade-off between detection performance and the total number of bounding boxes selected for annotation.
Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art method.
arXiv Detail & Related papers (2023-07-16T04:27:03Z) - DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection [55.70982767084996]
A critical yet frequently overlooked challenge in the field of deepfake detection is the lack of a standardized, unified, comprehensive benchmark.
We present the first comprehensive benchmark for deepfake detection, called DeepfakeBench, which offers three key contributions.
DeepfakeBench contains 15 state-of-the-art detection methods, 9CL datasets, a series of deepfake detection evaluation protocols and analysis tools, as well as comprehensive evaluations.
arXiv Detail & Related papers (2023-07-04T01:34:41Z) - On the Importance of Building High-quality Training Datasets for Neural
Code Search [15.557818317497397]
We propose a data cleaning framework consisting of two subsequent filters: a rule-based syntactic filter and a model-based semantic filter.
We evaluate the effectiveness of our framework on two widely-used code search models and three manually-annotated code retrieval benchmarks.
arXiv Detail & Related papers (2022-02-14T12:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.