Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models
- URL: http://arxiv.org/abs/2601.22264v1
- Date: Thu, 29 Jan 2026 19:34:34 GMT
- Title: Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models
- Authors: Henri Aïdasso, Francis Bordeleau, Ali Tizghadam,
- Abstract summary: FlaXifyer is a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models.<n>LogSift is an interpretability technique that identifies influential log statements in under one second.<n> Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.
- Score: 1.2744523252873348
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In principle, Continuous Integration (CI) pipeline failures provide valuable feedback to developers on code-related errors. In practice, however, pipeline jobs often fail intermittently due to non-deterministic tests, network outages, infrastructure failures, resource exhaustion, and other reliability issues. These intermittent (flaky) job failures lead to substantial inefficiencies: wasted computational resources from repeated reruns and significant diagnosis time that distracts developers from core activities and often requires intervention from specialized teams. Prior work has proposed machine learning techniques to detect intermittent failures, but does not address the subsequent diagnosis challenge. To fill this gap, we introduce FlaXifyer, a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.
Related papers
- DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z) - Efficient Detection of Intermittent Job Failures Using Few-Shot Learning [2.8402080392117757]
We introduce a novel approach to intermittent job failure detection using few-shot learning.<n>Our approach achieves 70-88% F1-score with only 12 shots in all projects, outperforming the state-of-the-art (SOTA) approach.
arXiv Detail & Related papers (2025-07-05T22:04:01Z) - On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories [2.8402080392117757]
flaky job failures are one of the main issues hindering Continuous Deployment (CD)<n>This study examines 4,511 flaky job failures at TELUS to identify the different categories of flaky failures that we prioritize based on Recency, Frequency, and Monetary (RFM) measures.
arXiv Detail & Related papers (2025-01-09T05:15:55Z) - Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis [29.800380941293277]
Engineers prioritize two categories of log information for diagnosis: fault-indicating descriptions and fault-indicating parameters.
We propose an approach to automatically extract faultindicating information from logs for fault diagnosis, named LoFI.
LoFI outperforms all baseline methods by a significant margin, achieving an absolute improvement of 25.837.9 in F1 over the best baseline method, ChatGPT.
arXiv Detail & Related papers (2024-09-20T15:00:47Z) - PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning [58.85063149619348]
We propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows.
Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets.
arXiv Detail & Related papers (2023-01-25T16:34:43Z) - Fast and Accurate Error Simulation for CNNs against Soft Errors [64.54260986994163]
We present a framework for the reliability analysis of Conal Neural Networks (CNNs) via an error simulation engine.
These error models are defined based on the corruption patterns of the output of the CNN operators induced by faults.
We show that our methodology achieves about 99% accuracy of the fault effects w.r.t. SASSIFI, and a speedup ranging from 44x up to 63x w.r.t.FI, that only implements a limited set of error models.
arXiv Detail & Related papers (2022-06-04T19:45:02Z) - Failure Identification from Unstable Log Data using Deep Learning [0.27998963147546146]
We present CLog as a method for failure identification.
By representing the log data as sequences of subprocesses instead of sequences of log events, the effect of the unstable log data is reduced.
Our experimental results demonstrate that the learned subprocesses representations reduce the instability in the input.
arXiv Detail & Related papers (2022-04-06T07:41:48Z) - LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak
Supervision [63.08516384181491]
We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts.
Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect.
Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
arXiv Detail & Related papers (2021-11-02T15:16:08Z) - Tracking the risk of a deployed model and detecting harmful distribution
shifts [105.27463615756733]
In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially.
We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate.
arXiv Detail & Related papers (2021-10-12T17:21:41Z) - Robust and Transferable Anomaly Detection in Log Data using Pre-Trained
Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users.
We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z) - Feature Engineering for Scalable Application-Level Post-Silicon
Debugging [0.456877715768796]
We present solutions for both observability enhancement and root-cause diagnosis of post-silicon System-on-Chips (SoCs) validation.
We model specification of interacting flows in typical applications for message selection.
We define diagnosis problem as identifying buggy traces as outliers and bug-free traces as inliers/normal behaviors.
arXiv Detail & Related papers (2021-02-08T22:11:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.