Fugu-MT 論文翻訳(概要): Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

論文の概要: Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

arxiv url: http://arxiv.org/abs/2605.14568v1
Date: Thu, 14 May 2026 08:38:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.727575
Title: Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
Title（参考訳）: 振る舞い駆動型ソフトウェアテストスイートにおけるサブシナリオリファクタリングオプションのマイニング - ML分類器とLLM-Judgeベースライン
Authors: Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal,
Abstract要約: 振る舞い駆動開発(BDD)ソフトウェアテストスイートは、重複したステップサブシーケンスを蓄積します。 3つのパブリッシュパターンが利用可能である(ファイルの背景、再利用可能な再利用可能なシナリオ呼び出し、組織間の共有高レベルステップ)。繰り返し続くサブシーケンスが抽出に値するか、どのメカニズムが適用されるかを自動化する前の作業はありません。
参考スコア（独自算出の注目度）: 1.9537983097153042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p < 1e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, or cross-organisational shared-step candidate. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.
Abstract（参考訳）: コンテキスト。振る舞い駆動開発(BDD)ソフトウェアテストスイートは、重複したステップサブシーケンスを蓄積します。 3つのリファクタリングパターンが公開されています(ファイルのバックグラウンド、再利用可能な再起動、組織間の共有高レベルステップ)。目的。適合性(抽出にふさわしい)をリファクタリングし、各パターンを3つのパターンの1つに事前マップし、公開BDDエコシステム全体での頻度を定量化する。方法。 339-Repository / 276-upstream-owner Gherkin corpus の全ての連続した L-step ウィンドウ (L in [2, 18]) は、パラフレーズ・ロバストクラスタ識別子によってキーされ、3つのスコープでカウントされる。 SBERT (Sentence-BERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) はパラフレーズ等価スライスを復元する。 3人の著者は、成層化された200スライスプールを記述されたルーリックにラベル付けします。 5倍のクロスバリデーションの下で訓練されたeXtreme Gradient Boosting(XGBoost)抽出値分類器を、チューニングされたルールベースラインと2つのオープンウェイトなLarge Language Model(LLM)判定器と比較する。結果。鉱夫は5,382,249個のスライスを692,020個の繰り返しパターンに分解する。 3人の著者Fleiss' kappa = 0.56(抽出値)と0.79(力学)である。分類器はF1 = 0.891 (95% CI [0.852, 0.927]) に達し、規則ベースライン (F1 = 0.836, p = 0.017) とより良いLCM判定 (F1 = 0.728, p < 1e-4) の両方を上回る。 75.0%、59.5%、11.7%のシナリオは、ファイル内背景、リポジトリ内再利用可能なシナリオ、組織間の共有ステップ候補を持っている。結論。 Paraphrase-robust subscenarioの発見は、BDDリファクタリングの機会に関する全社的な調査をもたらす。パイプライン、分類器の予測、ラベル付きプール、ルーリックは、Apache-2.0の下でリリースされている。

論文の概要: Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

関連論文リスト