Understanding and Detecting Flaky Builds in GitHub Actions
- URL: http://arxiv.org/abs/2602.02307v1
- Date: Mon, 02 Feb 2026 16:39:56 GMT
- Title: Understanding and Detecting Flaky Builds in GitHub Actions
- Authors: Wenhao Ge, Chen Zhang,
- Abstract summary: We present a large-scale empirical study of flaky builds in GitHub Actions based on rerun data from 1,960 Java projects.<n>We identify 15 distinct categories of flaky failures, among which flaky tests, network issues, and dependency resolution issues are the most prevalent.<n>We propose a machine learning-based approach for detecting flaky failures at the job level.
- Score: 6.3850400710838615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continuous Integration (CI) is widely used to provide rapid feedback on code changes; however, CI build outcomes are not always reliable. Builds may fail intermittently due to non-deterministic factors, leading to flaky builds that undermine developers' trust in CI, waste computational resources, and threaten the validity of CI-related empirical studies. In this paper, we present a large-scale empirical study of flaky builds in GitHub Actions based on rerun data from 1,960 open-source Java projects. Our results show that 3.2% of builds are rerun, and 67.73% of these rerun builds exhibit flaky behavior, affecting 1,055 (51.28%) of the projects. Through an in-depth failure analysis, we identify 15 distinct categories of flaky failures, among which flaky tests, network issues, and dependency resolution issues are the most prevalent. Building on these findings, we propose a machine learning-based approach for detecting flaky failures at the job level. Compared with a state-of-the-art baseline, our approach improves the F1-score by up to 20.3%.
Related papers
- Cross-Project Flakiness: A Case Study of the OpenStack Ecosystem [12.704721607953433]
Flakiness erodes developer trust in test results, wastes computational resources, and undermines Continuous Integration reliability.<n>We focus on cross-project flakiness, where flaky tests impact multiple projects, and inconsistent flakiness, where a test exhibits flakiness in some projects but remains stable in others.<n>These findings underline the need for better coordination across complex ecosystems, standardized CI configurations, and improved test isolation strategies.
arXiv Detail & Related papers (2026-02-10T01:03:28Z) - Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z) - Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study [5.127121704630949]
We analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset.<n>Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration.
arXiv Detail & Related papers (2026-01-29T22:06:58Z) - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z) - Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving [62.71545696485824]
We introduce AGENT KB, a universal memory infrastructure enabling seamless experience sharing across heterogeneous agent frameworks without retraining.<n>AGENT KB aggregates trajectories into a structured knowledge base and serves lightweight APIs.<n>We validate AGENT across major frameworks on GAIA, Humanity's Last Exam, GPQA, and SWE-bench.
arXiv Detail & Related papers (2025-07-08T17:59:22Z) - Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios [31.749442120603774]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads.<n>We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z) - Detecting Continuous Integration Skip : A Reinforcement Learning-based Approach [0.4297070083645049]
Continuous Integration (CI) practices facilitate the seamless integration of code changes by employing automated building and testing processes.
Some frameworks, such as Travis CI and GitHub Actions have significantly contributed to simplifying and enhancing the CI process.
Developers continue to encounter difficulties in accurately flagging commits as either suitable for CI execution or as candidates for skipping.
arXiv Detail & Related papers (2024-05-15T18:48:57Z) - Detecting Build Dependency Errors in Incremental Builds [13.823208277774572]
We propose EChecker to detect build dependency errors in the context of incremental builds.
EChecker automatically updates actual build dependencies by inferring them from C/C++ pre-processor directives and Makefile changes from new commits.
EChecker increases the build dependency error detection efficiency by an average of 85.14 times.
arXiv Detail & Related papers (2024-04-20T07:01:11Z) - DARTS-: Robustly Stepping out of Performance Collapse Without Indicators [74.21019737169675]
Differentiable architecture search suffers from long-standing performance instability.
indicators such as Hessian eigenvalues are proposed as a signal to stop searching before the performance collapses.
In this paper, we undertake a more subtle and direct approach to resolve the collapse.
arXiv Detail & Related papers (2020-09-02T12:54:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.