Fugu-MT 論文翻訳(概要): Beyond Coverage and Kill Scores: Empirically Measuring Test Suite Behavioural Gaps

論文の概要: Beyond Coverage and Kill Scores: Empirically Measuring Test Suite Behavioural Gaps

arxiv url: http://arxiv.org/abs/2606.10417v1
Date: Tue, 09 Jun 2026 04:46:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:37.990765
Title: Beyond Coverage and Kill Scores: Empirically Measuring Test Suite Behavioural Gaps
Title（参考訳）: テストスイートのビヘイビアギャップを実証的に測定する
Authors: Partha Protim Paul, Reid Holmes,
Abstract要約: 従来のテスト適合度メトリクスは、期待される振る舞いに準拠するかどうかではなく、システムの実装を測定する。私たちは、コードが何をするのか、実際に何をするのかのギャップを調査するために、概念実証の自動化アプローチを導入します。 8,922のメソッドからなる10の人気のあるオープンソースJavaライブラリに対して,93.1%の精度で20,729の動作を抽出し,アプローチを評価した。
参考スコア（独自算出の注目度）: 4.434030666628529
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional test adequacy metrics measure a system's implementation, not whether it adheres to its expected behaviour. While developers rely heavily on code coverage and mutation testing to assess test suite quality, these metrics are fundamentally implementation-centric and cannot detect gaps between what the code is expected to do and what it actually does. Unfortunately, there has been no way to reliably detect these discrepancies; in this paper we introduce an automated proof-of-concept approach to investigate these gaps. The approach extracts expected method-level behaviours from natural language documentation and source code, maps them to existing test cases, and identifies gaps between expected and validated behaviours. We evaluate the approach across ten popular open-source Java libraries comprising 8,922 methods, extracting 20,729 behaviours with 93.1% precision. Our empirical analysis conservatively estimates that 17.5% of detected expected behaviours remain entirely untested, which we term as the test suite's behavioural gap. To determine if these gaps are merely an artifact of human-driven testing, we evaluate state-of-the-art automated test generators (EVOSUITE / ASTER), finding that they similarly fail to validate at least 20.6% / 27.1% of detected expected behaviours. We further demonstrate that behavioural gaps are not predicted by traditional structural metrics: the majority of untested behaviours occur in methods that already have high line coverage, and over half persist in methods with high mutation kill score. These results suggest behavioural coverage acts as an independent dimension of test suite adequacy that can complement traditional structural metrics.
Abstract（参考訳）: 従来のテスト適合度メトリクスは、期待される振る舞いに準拠するかどうかではなく、システムの実装を測定する。開発者はテストスイートの品質を評価するためにコードカバレッジと突然変異テストに大きく依存しているが、これらのメトリクスは基本的に実装中心であり、コードが何をするのかと実際に何をするのかのギャップを検出することはできない。残念ながら、これらの不一致を確実に検出する方法は存在せず、本稿では、これらのギャップを調査するための自動概念実証手法を提案する。このアプローチは、自然言語のドキュメントとソースコードから期待されるメソッドレベルの振る舞いを抽出し、それらを既存のテストケースにマップし、期待される振る舞いと検証された振る舞いのギャップを特定する。 8,922のメソッドからなる10の人気のあるオープンソースJavaライブラリに対して,93.1%の精度で20,729の動作を抽出し,アプローチを評価した。私たちの経験分析では、検出された振る舞いの17.5%は完全にテストされていないままであり、テストスイートの動作ギャップと呼ばれていると推定しています。これらのギャップが単に人間によるテストの成果物であるかどうかを判断するために、最先端の自動テストジェネレータ(EVOSUITE/ASTER)を評価し、検出された振る舞いの少なくとも20.6%/27.1%の検証に失敗している。テストされていない動作の大部分は、すでに高いラインカバレッジを持つメソッドで発生し、半数以上が高い突然変異致死スコアを持つメソッドで持続する。これらの結果は、振る舞いカバレッジが、従来の構造的メトリクスを補完できる独立したテストスイートの次元として振る舞うことを示唆している。

論文の概要: Beyond Coverage and Kill Scores: Empirically Measuring Test Suite Behavioural Gaps

関連論文リスト