Fugu-MT 論文翻訳(概要): An Empirical Study of Gemini 3 for Detecting Natural Language Test Smells in Manual Test Cases

論文の概要: An Empirical Study of Gemini 3 for Detecting Natural Language Test Smells in Manual Test Cases

arxiv url: http://arxiv.org/abs/2606.13804v1
Date: Thu, 11 Jun 2026 18:20:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:42.570233
Title: An Empirical Study of Gemini 3 for Detecting Natural Language Test Smells in Manual Test Cases
Title（参考訳）: 手動試験における自然言語検査スメル検出のためのジェミニ3の実証的研究
Authors: Keila Lucas, Rohit Gheyi, Márcio Ribeiro, Fabio Palomba, Luana Martins, Elvys Soares,
Abstract要約: 本研究では, 現代の大規模言語モデル (GEMINI-3-PRO-PREVIEW) が, 自然言語手動テストケースの匂いを識別できるかどうかを検討する。我々のアプローチは完全なテストケースを評価し、モデルがテストステップ間の関係や依存関係を考慮できるようにします。テストの匂いは実際に広まっており、平均して1ステップごとに1つ近いテストの匂いが検出されている。
参考スコア（独自算出の注目度）: 7.100719635469756
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Manual testing, in which testers follow natural language instructions to validate system behavior, remains essential for uncovering issues that are difficult to capture with automation. However, manual test cases often contain test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce reliability, maintainability, and reproducibility. Existing detection approaches largely depend on manually engineered rules and thus struggle to generalize and scale across heterogeneous test suites. In our previous work, we assessed the feasibility of using Small Language Models (SLMs) for test smell detection by evaluating GEMMA-3-4B, LLAMA-3.2-3B, and PHI-4-14B on test steps from 143 real-world Ubuntu test cases, covering seven smell types. PHI-4-14B achieved the best performance. In this article, we investigate whether a contemporary Large Language Model (GEMINI-3-PRO-PREVIEW) available at the time of the study can identify test smells in natural language manual test cases using a prompt-based, whole-test-case analysis strategy. Unlike approaches that analyze individual test steps in isolation, our approach evaluates complete test cases, enabling the model to consider relationships and dependencies among test steps. We evaluate the approach on 100 Ubuntu test cases covering seven test smell types and compare its performance against previously evaluated SLMs, including GEMMA-3-4B, LLAMA-3.2-3B, and PHI-4-14B. Our results show that GEMINI-3-PRO-PREVIEW outperforms the SLMs, while producing actionable explanations that can help practitioners revise manual test cases for greater clarity and consistency. We also find that test smells are pervasive in practice, with nearly one detected test smell per step on average, highlighting the need for scalable and automated quality support for manual testing artifacts.
Abstract（参考訳）: テスタがシステムの振る舞いを検証するために自然言語の指示に従う手動テストは、自動化で捉えるのが難しい問題を明らかにする上で、依然として不可欠である。しかしながら、手動のテストケースには、テストの臭い、曖昧さ、冗長性、信頼性、保守性、再現性の低下といった品質上の問題が含まれていることが多い。既存の検出アプローチは、手動で設計したルールに大きく依存しているため、異種テストスイートをまたいだ一般化とスケーリングに苦慮している。 GEMMA-3-4B, LLAMA-3.2-3B, PHI-4-14Bを実世界のUbuntuテストケース143のテストステップで評価し, 嗅覚検出に小型言語モデル(SLM)を用いることの可能性を検討した。 PHI-4-14Bは最高の性能を達成した。本稿では,本研究で利用可能な現代大規模言語モデル(GEMINI-3-PRO-PREVIEW)が,アクシデントベースの全テストケース分析戦略を用いて,自然言語手動テストケースにおけるテスト臭いを識別できるかどうかを検討する。個別のテストステップを分離して分析するアプローチとは異なり、我々のアプローチは完全なテストケースを評価し、モデルがテストステップ間の関係や依存関係を考慮できるようにします。 GEMMA-3-4B, LLAMA-3.2-3B, PHI-4-14B など, 従来評価されていたSLMとの比較を行った。この結果から, GEMINI-3-PRO-PREVIEWはSLMよりも優れており, 実践者が手動テストケースをより明確で一貫性のあるものに修正する上で有効な説明が得られている。また、テストの臭いが実際に広まっており、平均して1ステップごとに1つ近いテストの臭いが検出されており、手動テストのアーティファクトに対してスケーラブルで自動化された品質サポートの必要性を強調しています。

論文の概要: An Empirical Study of Gemini 3 for Detecting Natural Language Test Smells in Manual Test Cases

関連論文リスト