Towards an Optimized Benchmarking Platform for CI/CD Pipelines
- URL: http://arxiv.org/abs/2510.18640v1
- Date: Tue, 21 Oct 2025 13:43:20 GMT
- Title: Towards an Optimized Benchmarking Platform for CI/CD Pipelines
- Authors: Nils Japke, Sebastian Koch, Helmut Lukasczyk, David Bermbach,
- Abstract summary: Benchmarking is essential for identifying performance regressions and maintaining service-level agreements.<n>Performance benchmarks are resource-intensive and time-consuming.<n>There is currently no practical system that integrates these optimizations seamlessly into real-world Continuous Integration / Continuous Deployment pipelines.
- Score: 1.3999481573773072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performance regressions in large-scale software systems can lead to substantial resource inefficiencies, making their early detection critical. Frequent benchmarking is essential for identifying these regressions and maintaining service-level agreements (SLAs). Performance benchmarks, however, are resource-intensive and time-consuming, which is a major challenge for integration into Continuous Integration / Continuous Deployment (CI/CD) pipelines. Although numerous benchmark optimization techniques have been proposed to accelerate benchmark execution, there is currently no practical system that integrates these optimizations seamlessly into real-world CI/CD pipelines. In this vision paper, we argue that the field of benchmark optimization remains under-explored in key areas that hinder its broader adoption. We identify three central challenges to enabling frequent and efficient benchmarking: (a) the composability of benchmark optimization strategies, (b) automated evaluation of benchmarking results, and (c) the usability and complexity of applying these strategies as part of CI/CD systems in practice. We also introduce a conceptual cloud-based benchmarking framework handling these challenges transparently. By presenting these open problems, we aim to stimulate research toward making performance regression detection in CI/CD systems more practical and effective.
Related papers
- CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios [17.11442807888366]
Causal is a benchmark suite designed to assess the robustness of time-series causal discovery methods under violations of modeling assumptions.<n>We conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios.<n>The methods exhibiting superior overall performance across diverse scenarios are almost deep learning-based approaches.
arXiv Detail & Related papers (2026-02-08T11:27:06Z) - Benchmarking that Matters: Rethinking Benchmarking for Practical Impact [2.952553461344481]
We propose a vision centered on curated real-world-inspired benchmarks, practitioner-accessible feature spaces and community-maintained performance databases.<n>Real progress requires coordinated effort: A living benchmarking ecosystem that evolves with real-world insights and supports both scientific understanding and industrial use.
arXiv Detail & Related papers (2025-11-15T15:42:15Z) - A Benchmark Suite for Multi-Objective Optimization in Battery Thermal Management System Design [0.0]
This study develops and presents a specialized benchmark suite for multi-objective optimization in Battery Thermal Management System (BTMS) design.<n>The primary goal of this benchmark suite is to provide a practical and relevant testing ground for evolutionary algorithms and optimization methods.
arXiv Detail & Related papers (2025-10-29T06:48:22Z) - WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking [60.35109192765302]
Information seeking is a core capability that enables autonomous reasoning and decision-making.<n>We propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories.<n>Our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
arXiv Detail & Related papers (2025-10-28T17:51:42Z) - Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation [55.47971671635531]
Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA)<n>Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge.<n>Existing systems primarily rely on unstructured documents, while largely overlooking relational databases.
arXiv Detail & Related papers (2025-09-30T22:19:44Z) - XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning [26.063477716451512]
We introduce XQC: a well-motivated, sample-efficient deep actor-critic algorithm built upon soft actor-critic.<n>We achieve state-of-the-art sample efficiency across 55 proprioception and 15 vision-based continuous control tasks.
arXiv Detail & Related papers (2025-09-29T17:58:53Z) - NDCG-Consistent Softmax Approximation with Accelerated Convergence [67.10365329542365]
We propose novel loss formulations that align directly with ranking metrics.<n>We integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method.<n> Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance.
arXiv Detail & Related papers (2025-06-11T06:59:17Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.<n>Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.<n>This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z) - Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime [59.27851754647913]
Predictive optimization is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising.
We develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for advertising.
Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO.
arXiv Detail & Related papers (2023-11-13T13:19:34Z) - DeLag: Using Multi-Objective Optimization to Enhance the Detection of
Latency Degradation Patterns in Service-based Systems [0.76146285961466]
We present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems.
DeLag simultaneously searches for multiple latency patterns while optimizing precision, recall and dissimilarity.
arXiv Detail & Related papers (2021-10-21T13:59:32Z) - Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput.
Until today, no such learning-based algorithms have shown practical potential in this domain.
We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks.
We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.