Fugu-MT 論文翻訳(概要): SACS: A Code Smell Dataset using Semi-automatic Generation Approach

論文の概要: SACS: A Code Smell Dataset using Semi-automatic Generation Approach

arxiv url: http://arxiv.org/abs/2602.15342v1
Date: Tue, 17 Feb 2026 04:15:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-18 16:03:17.977278
Title: SACS: A Code Smell Dataset using Semi-automatic Generation Approach
Title（参考訳）: SACS:半自動生成を用いたコードスメルデータセット
Authors: Hanyu Zhang, Tomoji Kishi,
Abstract要約: コードの臭いはソフトウェアにおいて大きな課題であり、遅延設計や実装上の欠陥を示している。機械学習技術を適用する上で最大の課題のひとつは、高品質なコードの臭いデータセットがないことだ。本研究では,高品質なデータサンプルを用いたコード臭いデータセットを生成するための半自動手法について検討する。
参考スコア（独自算出の注目度）: 7.718926822172738
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past of decades, the research on code smell has received extensive attention. Especially the researches applied machine learning-technique have become a popular topic in recent studies. However, one of the biggest challenges to apply machine learning-technique is the lack of high-quality code smell datasets. Manually constructing such datasets is extremely labor-intensive, as identifying code smells requires substantial development expertise and considerable time investment. In contrast, automatically generated datasets, while scalable, frequently exhibit reduced label reliability and compromised data quality. To overcome this challenge, in this study, we explore a semi-automatic approach to generate a code smell dataset with high quality data samples. Specifically, we first applied a set of automatic generation rules to produce candidate smelly samples. We then employed multiple metrics to group the data samples into an automatically accepted group and a manually reviewed group, enabling reviewers to concentrate their efforts on ambiguous samples. Furthermore, we established structured review guidelines and developed a annotation tool to support the manual validation process. Based on the proposed semi-automatic generation approach, we created an open-source code smell dataset, SACS, covering three widely studied code smells: Long Method, Large Class, and Feature Envy. Each code smell category includes over 10,000 labeled samples. This dataset could provide a large-scale and publicly available benchmark to facilitate future studies on code smell detection and automated refactoring.
Abstract（参考訳）: コードの臭いはソフトウェアのリファクタリングにおいて大きな課題であり、ソフトウェア保守性と進化を損なう可能性のある設計や実装上の欠陥を示しています。何十年もの間、コードの臭いの研究は大きな注目を集めてきた。特に、機械学習技術を適用した研究は近年、一般的な話題となっている。しかし、機械学習技術を適用する上での最大の課題の1つは、高品質なコードの臭いデータセットの欠如である。このようなデータセットを手作業で構築するのは、コードの臭いを特定するには、相当な開発専門知識と相当な時間的投資が必要であるため、非常に労力がかかる。対照的に、自動生成されたデータセットはスケーラブルだが、ラベルの信頼性が低下し、データ品質が損なわれている。この課題を克服するために、我々は、高品質なデータサンプルを用いたコード臭いデータセットを生成するための半自動的なアプローチを探索する。具体的には、まず一連の自動生成規則を適用し、候補臭気サンプルを生成した。次に、複数のメトリクスを使用して、データサンプルを自動で承認されたグループと手動でレビューしたグループにグループ化し、レビュー担当者があいまいなサンプルに集中できるようにしました。さらに、構造化されたレビューガイドラインを確立し、手動検証プロセスを支援するアノテーションツールを開発した。提案した半自動生成アプローチに基づいて,Long Method, Large Class, Feature Envyという,広く研究されている3つのコードの臭いをカバーする,オープンソースのコード臭いデータセットSACSを開発した。各コードの臭いカテゴリーには1万以上のラベル付きサンプルが含まれている。このデータセットは、コードの臭いの検出と自動リファクタリングに関する将来の研究を容易にするために、大規模でパブリックなベンチマークを提供する可能性がある。

論文の概要: SACS: A Code Smell Dataset using Semi-automatic Generation Approach

関連論文リスト