Related papers: Just-in-Time Catching Test Generation at Meta

Just-in-Time Catching Test Generation at Meta

URL: http://arxiv.org/abs/2601.22832v1
Date: Fri, 30 Jan 2026 10:58:32 GMT
Title: Just-in-Time Catching Test Generation at Meta
Authors: Matthew Becker, Yifei Chen, Nicholas Cochran, Pouyan Ghasemi, Abhishek Gulati, Mark Harman, Zachary Haluza, Mehrdad Honarkhah, Herve Robert, Jiacheng Liu, Weini Liu, Sreeja Thummala, Xiaoning Yang, Rui Xin, Sophie Zeng,
Abstract summary: Just-in-Time catching tests are meant to fail, surfacing bugs before code lands.<n>We show code-change-aware methods improve candidate catch generation 4x over hardening tests and 20x over coincidentally failing tests.<n>We reported 41 candidate catches to engineers; 8 were confirmed to be true positives, 4 of which would have led to serious failures had they remained uncaught.
Score: 10.710139850909073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We report on Just-in-Time catching test generation at Meta, designed to prevent bugs in large scale backend systems of hundreds of millions of line of code. Unlike traditional hardening tests, which pass at generation time, catching tests are meant to fail, surfacing bugs before code lands. The primary challenge is to reduce development drag from false positive test failures. Analyzing 22,126 generated tests, we show code-change-aware methods improve candidate catch generation 4x over hardening tests and 20x over coincidentally failing tests. To address false positives, we use rule-based and LLM-based assessors. These assessors reduce human review load by 70%. Inferential statistical analysis showed that human-accepted code changes are assessed to have significantly more false positives, while human-rejected changes have significantly more true positives. We reported 41 candidate catches to engineers; 8 were confirmed to be true positives, 4 of which would have led to serious failures had they remained uncaught. Overall, our results show that Just-in-Time catching is scalable, industrially applicable, and that it prevents serious failures from reaching production.

Related papers

YATE: The Role of Test Repair in LLM-Based Unit Test Generation [22.67442101368384]
We propose a technique for repairing some of these incorrect tests through a combination of rule-based static analysis and re-prompting.<n>We evaluate this simple approach, named YATE, on a set of 6 open-source projects.<n>YATE achieves 22% higher line coverage, 20% higher branch coverage and kill 20% more mutants at a comparable cost.
arXiv Detail & Related papers (2025-07-24T11:32:31Z)
Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models [13.969152395348653]
RTED is a type-aware test generation technique for automatically detecting Python type errors.<n>We show that RTED can detect 22-29 more benchmarked type errors than four state-of-the-art techniques.<n>It is also capable of producing fewer false positives, achieving an improvement of 173.9%-245.9% in precision.
arXiv Detail & Related papers (2025-07-03T05:10:33Z)
Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges [12.931831095319456]
We show that hardening and catching tests raise exciting new challenges in the context of Large Language Models for software test generation.<n>A hardening test seeks to protect against future regressions, while a catching test is one that catches such a regression or a fault in new functionality introduced by a code change.<n>We show that any solution to Catching JiTTest generation can also be repurposed to catch latent faults in legacy code.
arXiv Detail & Related papers (2025-04-23T07:32:43Z)
Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure.<n>We investigated 207 versions of 6 open-source projects.<n>Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z)
Examining False Positives under Inference Scaling for Mathematical Reasoning [83.97128486951999]
We systematically examine the prevalence of false positive solutions in mathematical problem solving for language models.<n>Our experimental results reveal that: (1) false positive solutions persist across different models, datasets, and decoding methods, (2) sampling-based inference time scaling methods do not alleviate the problem, and (3) the pass@N evaluation metric is more susceptible to false positives.
arXiv Detail & Related papers (2025-02-10T07:49:35Z)
Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.<n>Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.<n>We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
Measuring the Influence of Incorrect Code on Test Generation [22.168699378889148]
We show that tests generated for incorrect code experience a 47% worse bug detection rate.<n>Improvements of +18% in accuracy, +4% coverage, and +34% in bug detection can be achieved by providing natural language code descriptions.
arXiv Detail & Related papers (2024-09-14T15:17:34Z)
Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
Taming Timeout Flakiness: An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes. Test timeouts are one contributing factor to such flaky test failures. Test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions.
arXiv Detail & Related papers (2024-02-07T20:01:41Z)
230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers [9.45325012281881]
Flaky tests are tests that can non-deterministically pass or fail, even in the absence of code changes. How to quickly determine if a test failed due to flakiness, or if it detected a bug?
arXiv Detail & Related papers (2024-01-28T22:36:30Z)
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
Back to the Future! Studying Data Cleanness in Defects4J and its Impact on Fault Localization [3.8040257966829802]
We examine Defects4J's fault-triggering tests, emphasizing the implications of developer knowledge of SBFL techniques. We found that 55% of the fault-triggering tests were newly added to replicate the bug or to test for regression. We also found that 22% of the fault-triggering tests were modified after the bug reports were created, containing developer knowledge of the bug.
arXiv Detail & Related papers (2023-10-29T20:19:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.