Related papers: Test Behaviors, Not Methods! Detecting Tests Obsessed by Methods

Test Behaviors, Not Methods! Detecting Tests Obsessed by Methods

URL: http://arxiv.org/abs/2602.00761v1
Date: Sat, 31 Jan 2026 14:58:39 GMT
Title: Test Behaviors, Not Methods! Detecting Tests Obsessed by Methods
Authors: Andre Hora, Andy Zaidman,
Abstract summary: Tests that verify multiple behaviors are harder to understand, lack focus, and are more coupled to the production code.<n>We propose a novel test smell named emphTest Obsessed by Method, a test method that covers multiple paths of a single production method.
Score: 3.6417668958891785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Best testing practices state that tests should verify a single functionality or behavior of the system. Tests that verify multiple behaviors are harder to understand, lack focus, and are more coupled to the production code. An attempt to identify this issue is the test smell \emph{Eager Test}, which aims to capture tests that verify too much functionality based on the number of production method calls. Unfortunately, prior research suggests that counting production method calls is an inaccurate measure, as these calls do not reliably serve as a proxy for functionality. We envision a complementary solution based on runtime analysis: we hypothesize that some tests that verify multiple behaviors will likely cover multiple paths of the same production methods. Thus, we propose a novel test smell named \emph{Test Obsessed by Method}, a test method that covers multiple paths of a single production method. We provide an initial empirical study to explore the presence of this smell in 2,054 tests provided by 12 test suites of the Python Standard Library. (1) We detect 44 \emph{Tests Obsessed by Methods} in 11 of the 12 test suites. (2) Each smelly test verifies a median of two behaviors of the production method. (3) The 44 smelly tests could be split into 118 novel tests. (4) 23% of the smelly tests have code comments recognizing that distinct behaviors are being tested. We conclude by discussing benefits, limitations, and further research.

Related papers

Reduction of Test Re-runs by Prioritizing Potential Order Dependent Flaky Tests [0.5798758080057375]
Flaky tests can make automated software testing unreliable due to their unpredictable behavior.<n>A common type of flaky test is the order-dependent (OD) test.<n>We propose a method to prioritize potential OD tests.
arXiv Detail & Related papers (2025-10-30T06:17:30Z)
JS-TOD: Detecting Order-Dependent Flaky Tests in Jest [5.178246622041266]
JS-TOD is a tool that can extract, reorder, and rerun Jest tests to reveal possible order-dependent test flakiness.<n>Test order dependency is one of the leading causes of test flakiness.
arXiv Detail & Related papers (2025-08-30T11:44:14Z)
Intention-Driven Generation of Project-Specific Test Cases [45.2380093475221]
We propose IntentionTest, which generates project-specific tests given the description of validation intention.<n>We extensively evaluate IntentionTest against state-of-the-art baselines (DA, ChatTester, and EvoSuite) on 4,146 test cases from 13 open-source projects.
arXiv Detail & Related papers (2025-07-28T08:35:04Z)
Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure.<n>We investigated 207 versions of 6 open-source projects.<n>Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z)
Detecting and Evaluating Order-Dependent Flaky Tests in JavaScript [3.6513675781808357]
Flaky tests pose a significant issue for software testing.<n>Previous research has identified test order dependency as one of the most prevalent causes of flakiness.<n>This paper aims to investigate test order dependency in JavaScript tests.
arXiv Detail & Related papers (2025-01-22T06:52:11Z)
LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z)
Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
API providers may quantize, watermark, or finetune the underlying model, changing the output distribution.<n>We formalize detecting such distortions by Model Equality Testing, a two-sample testing problem.<n>A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
arXiv Detail & Related papers (2024-10-26T18:34:53Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
Evaluating the Robustness of Test Selection Methods for Deep Neural Networks [32.01355605506855]
Testing deep learning-based systems is crucial but challenging due to the required time and labor for labeling collected raw data. To alleviate the labeling effort, multiple test selection methods have been proposed where only a subset of test data needs to be labeled. This paper explores when and to what extent test selection methods fail for testing.
arXiv Detail & Related papers (2023-07-29T19:17:49Z)
Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision [85.07855130048951]
We study a more practical task setting, called test-agnostic long-tailed recognition, where the training class distribution is long-tailed. We propose a new method, called Test-time Aggregating Diverse Experts (TADE), that trains diverse experts to excel at handling different test distributions. We theoretically show that our method has provable ability to simulate unknown test class distributions.
arXiv Detail & Related papers (2021-07-20T04:10:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.