Fugu-MT 論文翻訳(概要): Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software

論文の概要: Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software

arxiv url: http://arxiv.org/abs/2603.09029v1
Date: Mon, 09 Mar 2026 23:57:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:23.902892
Title: Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software
Title（参考訳）: 量子ソフトウェアにおけるフレーク試験の自動検出とルートカス解析
Authors: Janakan Sivaloganathan, Ainaz Jamshidi, Andriy Miranskyy, Lei Zhang,
Abstract要約: 本稿では,量子ソフトウェアリポジトリにおけるフレキシブルテスト関連の問題とプルリクエストを検出する自動パイプラインを提案する。我々は、既存の量子フレキテストデータセットを拡張し、フレキネス分類と根本原因同定のための大規模言語モデルの性能を評価する。最高のパフォーマンスモデルであるGoogle Geminiは、フレキネス検出のためのF1スコア0.9420、根本原因同定のための0.9643を達成している。
参考スコア（独自算出の注目度）: 3.853925623717688
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Like classical software, quantum software systems rely on automated testing. However, their inherently probabilistic outputs make them susceptible to quantum flakiness -- tests that pass or fail inconsistently without code changes. Such quantum flaky tests can mask real defects and reduce developer productivity, yet systematic tooling for their detection and diagnosis remains limited. This paper presents an automated pipeline to detect flaky-test-related issues and pull requests in quantum software repositories and to support the identification of their root causes. We aim to expand an existing quantum flaky test dataset and evaluate the capability of Large Language Models (LLMs) for flakiness classification and root-cause identification. Building on a prior manual analysis of 14 quantum software repositories, we automate the discovery of additional flaky test cases using LLMs and cosine similarity. We further evaluate a variety of LLMs from OpenAI GPT, Meta LLaMA, Google Gemini, and Anthropic Claude suites for classifying flakiness and identifying root causes from issue descriptions and code context. Classification performance is assessed using standard performance metrics, including F1-score. Using our pipeline, we identify 25 previously unknown flaky tests, increasing the original dataset size by 54%. The best-performing model, Google Gemini, achieves an F1-score of 0.9420 for flakiness detection and 0.9643 for root-cause identification, demonstrating that LLMs can provide practical support for triaging flaky reports and understanding their underlying causes in quantum software. The expanded dataset and automated pipeline provide reusable artifacts for the quantum software engineering community. Future work will focus on improving detection robustness and exploring automated repair of quantum flaky tests.
Abstract（参考訳）: 古典的ソフトウェアと同様に、量子ソフトウェアシステムは自動テストに依存している。しかし、その本質的に確率的なアウトプットは、コードの変更なしに不整合に通過または失敗する、量子フレキネス(quantum flakiness)の影響を受けやすい。このような量子フレキテストは、実際の欠陥を隠蔽し、開発者の生産性を低下させるが、検出と診断のための体系的なツールはまだ限られている。本稿では,量子ソフトウェアリポジトリにおけるフレキテスト関連問題とプルリクエストの検出と,その根本原因の同定を支援するための自動パイプラインを提案する。我々は、既存の量子フレキテストデータセットを拡張し、フレキネス分類と根源同定のためのLarge Language Models(LLMs)の能力を評価することを目指している。 14の量子ソフトウェアリポジトリの以前の手動解析に基づいて、LSMとコサイン類似性を用いて、さらなるフレキなテストケースの発見を自動化する。さらに,OpenAI GPT, Meta LLaMA, Google Gemini, Anthropic Claude スイートから,フレキネスを分類し,問題記述やコードコンテキストから根本原因を特定するための様々な LLM の評価を行った。分類性能は、F1スコアを含む標準的なパフォーマンス指標を使用して評価される。パイプラインを使用して25の未知のフレキテストを特定し、元のデータセットサイズを54%増加させました。最高のパフォーマンスモデルであるGoogle Geminiは、フレキネス検出のためのF1スコア0.9420、根本原因同定のための0.9643を達成し、LLMがフレキなレポートをトリアージし、量子ソフトウェアにおける彼らの根本原因を理解するための実用的なサポートを提供できることを示した。拡張されたデータセットと自動パイプラインは、量子ソフトウェアエンジニアリングコミュニティのための再利用可能なアーティファクトを提供する。今後の研究は、検出の堅牢性を改善し、量子フレキテストの自動修復を探求することに集中する。

論文の概要: Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software

関連論文リスト