Related papers: Exceptional Behaviors: How Frequently Are They Tested?

Exceptional Behaviors: How Frequently Are They Tested?

URL: http://arxiv.org/abs/2602.05123v1
Date: Wed, 04 Feb 2026 23:15:18 GMT
Title: Exceptional Behaviors: How Frequently Are They Tested?
Authors: Andre Hora, Gordon Fraser,
Abstract summary: We run an instrumented version of test suites, monitor their execution, and collect information about the exceptions raised at runtime.<n>We analyze the test suites of 25 Python systems, covering 5,372 executed methods, 17.9M calls, and 1.4M raised exceptions.<n>Close to 80% of the methods that raise exceptions do so infrequently, but about 20% raise exceptions more frequently.
Score: 10.004295333072948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Exceptions allow developers to handle error cases expected to occur infrequently. Ideally, good test suites should test both normal and exceptional behaviors to catch more bugs and avoid regressions. While current research analyzes exceptions that propagate to tests, it does not explore other exceptions that do not reach the tests. In this paper, we provide an empirical study to explore how frequently exceptional behaviors are tested in real-world systems. We consider both exceptions that propagate to tests and the ones that do not reach the tests. For this purpose, we run an instrumented version of test suites, monitor their execution, and collect information about the exceptions raised at runtime. We analyze the test suites of 25 Python systems, covering 5,372 executed methods, 17.9M calls, and 1.4M raised exceptions. We find that 21.4% of the executed methods do raise exceptions at runtime. In methods that raise exceptions, on the median, 1 in 10 calls exercise exceptional behaviors. Close to 80% of the methods that raise exceptions do so infrequently, but about 20% raise exceptions more frequently. Finally, we provide implications for researchers and practitioners. We suggest developing novel tools to support exercising exceptional behaviors and refactoring expensive try/except blocks. We also call attention to the fact that exception-raising behaviors are not necessarily "abnormal" or rare.

Related papers

Understanding Bug-Reproducing Tests: A First Empirical Study [10.004295333072948]
We analyze 642 bug-reproducing tests of 15 real-world Python systems.<n>We find that bug-reproducing tests are not (statistically significantly) different from other tests regarding LOC, number of assertions, and complexity.<n>We detect that the majority 95% of the bug-reproducing tests reproduce a single bug, while 5% reproduce multiple bugs.
arXiv Detail & Related papers (2026-02-03T01:04:18Z)
Test Behaviors, Not Methods! Detecting Tests Obsessed by Methods [3.6417668958891785]
Tests that verify multiple behaviors are harder to understand, lack focus, and are more coupled to the production code.<n>We propose a novel test smell named emphTest Obsessed by Method, a test method that covers multiple paths of a single production method.
arXiv Detail & Related papers (2026-01-31T14:58:39Z)
Intention-Driven Generation of Project-Specific Test Cases [45.2380093475221]
We propose IntentionTest, which generates project-specific tests given the description of validation intention.<n>We extensively evaluate IntentionTest against state-of-the-art baselines (DA, ChatTester, and EvoSuite) on 4,146 test cases from 13 open-source projects.
arXiv Detail & Related papers (2025-07-28T08:35:04Z)
A Tool for Generating Exceptional Behavior Tests With Large Language Models [36.97613436193272]
We present exLong, a framework that automatically generates exceptional behavior tests (EBTs)<n>ExLong incorporates reasoning about exception-throwing traces, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces.<n>Our demonstration video illustrates how exLong can effectively assist developers in creating comprehensive EBTs for their project.
arXiv Detail & Related papers (2025-05-28T19:53:20Z)
Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure.<n>We investigated 207 versions of 6 open-source projects.<n>Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z)
Loop unrolling: formal definition and application to testing [33.432652829284244]
Testing processes usually aim at high coverage, but loops severely limit coverage ambitions since the number of iterations is generally not predictable.<n>This article provides a formal definition and a set of formal properties of unrolling.<n>Using this definition as the conceptual basis, we have applied an unrolling strategy to an existing automated testing framework.
arXiv Detail & Related papers (2025-02-21T15:36:21Z)
Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework [58.36391985790157]
In real world software development, improper or missing exception handling can severely impact the robustness and reliability of code.<n>We explore the use of large language models (LLMs) to improve exception handling in code.<n>We propose Seeker, a multi-agent framework inspired by expert developer strategies for exception handling.
arXiv Detail & Related papers (2024-12-16T12:35:29Z)
Towards Exception Safety Code Generation with Intermediate Representation Agents Framework [54.03528377384397]
Large Language Models (LLMs) often struggle with robust exception handling in generated code, leading to fragile programs that are prone to runtime errors.<n>We propose Seeker, a novel multi-agent framework that enforces exception safety in LLM generated code through an Intermediate Representation (IR) approach.<n>Seeker decomposes exception handling into five specialized agents: Scanner, Detector, Predator, Ranker, and Handler.
arXiv Detail & Related papers (2024-10-09T14:45:45Z)
exLong: Generating Exceptional Behavior Tests with Large Language Models [41.145231237535356]
ExLong is a framework that automatically generates exceptional behavior tests.<n>It embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests.<n>We compare exLong with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT-4o)
arXiv Detail & Related papers (2024-05-23T14:28:41Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.