Improving Merge Pipeline Throughput in Continuous Integration via Pull Request Prioritization
- URL: http://arxiv.org/abs/2508.08342v1
- Date: Mon, 11 Aug 2025 08:25:07 GMT
- Title: Improving Merge Pipeline Throughput in Continuous Integration via Pull Request Prioritization
- Authors: Maximilian Jungwirth, Martin Gruber, Gordon Fraser,
- Abstract summary: We propose to optimize the order of PRs in merge pipelines using practical build predictions.<n>By dynamically prioritizing likely passing PRs during peak hours, this approach maximizes throughput when it matters most.
- Score: 11.003075182677156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integrating changes into large monolithic software repositories is a critical step in modern software development that substantially impacts the speed of feature delivery, the stability of the codebase, and the overall productivity of development teams. To ensure the stability of the main branch, many organizations use merge pipelines that test software versions before the changes are permanently integrated. However, the load on merge pipelines is often so high that they become bottlenecks, despite the use of parallelization. Existing optimizations frequently rely on specific build systems, limiting their generalizability and applicability. In this paper we propose to optimize the order of PRs in merge pipelines using practical build predictions utilizing only historical build data, PR metadata, and contextual information to estimate the likelihood of successful builds in the merge pipeline. By dynamically prioritizing likely passing PRs during peak hours, this approach maximizes throughput when it matters most. Experiments conducted on a real-world, large-scale project demonstrate that predictive ordering significantly outperforms traditional first-in-first-out (FIFO), as well as non-learning-based ordering strategies. Unlike alternative optimizations, this approach is agnostic to the underlying build system and thus easily integrable into existing automated merge pipelines.
Related papers
- From Ad-Hoc Scripts to Orchestrated Pipelines: Architecting a Resilient ELT Framework for Developer Productivity Metrics [0.0]
This paper reports on our experience migrating from legacy scheduling to a robust Extract-Load-Transform pipeline.<n>Treating the metrics pipeline as a production-grade distributed system is a prerequisite for sustainable engineering analytics.
arXiv Detail & Related papers (2026-02-25T04:46:08Z) - Autonomous Data Processing using Meta-Agents [2.3732259124656907]
We present textbfAutonomous Data Processing using Meta-agents (ADP-MA), a framework that dynamically constructs, executes, and iteratively refines data processing pipelines.<n> ADP-MA emphasizes context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability.<n>We demonstrate ADP-MA through an interactive demo that showcases pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks.
arXiv Detail & Related papers (2026-01-30T20:58:17Z) - Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline [9.013236765328301]
We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines.<n> ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation.<n>Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust.
arXiv Detail & Related papers (2025-12-31T15:46:37Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - DyFlow: Dynamic Workflow Framework for Agentic Reasoning [79.19799197382478]
DyFlow is a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback.<n>We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation.<n>Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains.
arXiv Detail & Related papers (2025-09-30T10:36:23Z) - syftr: Pareto-Optimal Generative AI [40.80352098169579]
syftr is a framework that performs efficient multi-objective search over a broad space of agentic and non-agentic RAG configurations.<n>Syftr finds flows which are on average approximately 9 times cheaper while preserving most of the accuracy of the most accurate flows.
arXiv Detail & Related papers (2025-05-26T17:43:13Z) - Practical Pipeline-Aware Regression Test Optimization for Continuous Integration [9.079940595000087]
Continuous Integration (CI) is commonly applied to ensure consistent code quality.<n>Developers commonly split test executions across multiple pipelines, running small and fast tests in pre-submit stages while executing long-running and flaky tests in post-submit pipelines.<n>We developed a lightweight and pipeline-aware regression test optimization approach that employs Reinforcement Learning models trained on language-agnostic features.
arXiv Detail & Related papers (2025-01-20T15:39:16Z) - Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle [9.424552130799661]
We present techniques to harvest and query arbitrary metadata from machine learning pipelines.
We show how hindsight logging allows such statements to be added and executed post-hoc.
This is done in a "metadata later style" off the critical path of agile development.
arXiv Detail & Related papers (2024-08-05T14:21:00Z) - A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
A popular similarity-based feature upsampling pipeline has been proposed, which utilizes a high-resolution feature as guidance.<n>We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives.<n>We develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
arXiv Detail & Related papers (2024-07-02T14:12:21Z) - Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors [44.5740422079]
We show that pretraining with standard denoising objectives leads to dramatic gains across multiple architectures.
In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained.
arXiv Detail & Related papers (2023-10-04T17:17:06Z) - Deep incremental learning models for financial temporal tabular datasets
with distribution shifts [0.9790236766474201]
The framework uses a simple basic building block (decision trees) to build self-similar models of any required complexity.
We demonstrate our scheme using XGBoost models trained on the Numerai dataset and show that a two layer deep ensemble of XGBoost models over different model snapshots delivers high quality predictions.
arXiv Detail & Related papers (2023-03-14T14:10:37Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - DHA: End-to-End Joint Optimization of Data Augmentation Policy,
Hyper-parameter and Architecture [81.82173855071312]
We propose an end-to-end solution that integrates the AutoML components and returns a ready-to-use model at the end of the search.
Dha achieves state-of-the-art (SOTA) results on various datasets, especially 77.4% accuracy on ImageNet with cell based search space.
arXiv Detail & Related papers (2021-09-13T08:12:50Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z) - MLCask: Efficient Management of Component Evolution in Collaborative
Data Analytics Pipelines [29.999324319722508]
We address two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.
We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information.
The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases.
arXiv Detail & Related papers (2020-10-17T13:34:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.