Related papers: Towards an Optimized Benchmarking Platform for CI/CD Pipelines

Towards an Optimized Benchmarking Platform for CI/CD Pipelines

URL: http://arxiv.org/abs/2510.18640v1
Date: Tue, 21 Oct 2025 13:43:20 GMT
Title: Towards an Optimized Benchmarking Platform for CI/CD Pipelines
Authors: Nils Japke, Sebastian Koch, Helmut Lukasczyk, David Bermbach,
Abstract summary: Benchmarking is essential for identifying performance regressions and maintaining service-level agreements.<n>Performance benchmarks are resource-intensive and time-consuming.<n>There is currently no practical system that integrates these optimizations seamlessly into real-world Continuous Integration / Continuous Deployment pipelines.
Score: 1.3999481573773072
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Performance regressions in large-scale software systems can lead to substantial resource inefficiencies, making their early detection critical. Frequent benchmarking is essential for identifying these regressions and maintaining service-level agreements (SLAs). Performance benchmarks, however, are resource-intensive and time-consuming, which is a major challenge for integration into Continuous Integration / Continuous Deployment (CI/CD) pipelines. Although numerous benchmark optimization techniques have been proposed to accelerate benchmark execution, there is currently no practical system that integrates these optimizations seamlessly into real-world CI/CD pipelines. In this vision paper, we argue that the field of benchmark optimization remains under-explored in key areas that hinder its broader adoption. We identify three central challenges to enabling frequent and efficient benchmarking: (a) the composability of benchmark optimization strategies, (b) automated evaluation of benchmarking results, and (c) the usability and complexity of applying these strategies as part of CI/CD systems in practice. We also introduce a conceptual cloud-based benchmarking framework handling these challenges transparently. By presenting these open problems, we aim to stimulate research toward making performance regression detection in CI/CD systems more practical and effective.

Related papers

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios [17.11442807888366]
Causal is a benchmark suite designed to assess the robustness of time-series causal discovery methods under violations of modeling assumptions.<n>We conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios.<n>The methods exhibiting superior overall performance across diverse scenarios are almost deep learning-based approaches.
arXiv Detail & Related papers (2026-02-08T11:27:06Z)
Benchmarking that Matters: Rethinking Benchmarking for Practical Impact [2.952553461344481]
We propose a vision centered on curated real-world-inspired benchmarks, practitioner-accessible feature spaces and community-maintained performance databases.<n>Real progress requires coordinated effort: A living benchmarking ecosystem that evolves with real-world insights and supports both scientific understanding and industrial use.
arXiv Detail & Related papers (2025-11-15T15:42:15Z)
A Benchmark Suite for Multi-Objective Optimization in Battery Thermal Management System Design [0.0]
This study develops and presents a specialized benchmark suite for multi-objective optimization in Battery Thermal Management System (BTMS) design.<n>The primary goal of this benchmark suite is to provide a practical and relevant testing ground for evolutionary algorithms and optimization methods.
arXiv Detail & Related papers (2025-10-29T06:48:22Z)
WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking [60.35109192765302]
Information seeking is a core capability that enables autonomous reasoning and decision-making.<n>We propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories.<n>Our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
arXiv Detail & Related papers (2025-10-28T17:51:42Z)
Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation [55.47971671635531]
Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA)<n>Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge.<n>Existing systems primarily rely on unstructured documents, while largely overlooking relational databases.
arXiv Detail & Related papers (2025-09-30T22:19:44Z)
XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning [26.063477716451512]
We introduce XQC: a well-motivated, sample-efficient deep actor-critic algorithm built upon soft actor-critic.<n>We achieve state-of-the-art sample efficiency across 55 proprioception and 15 vision-based continuous control tasks.
arXiv Detail & Related papers (2025-09-29T17:58:53Z)
NDCG-Consistent Softmax Approximation with Accelerated Convergence [67.10365329542365]
We propose novel loss formulations that align directly with ranking metrics.<n>We integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method.<n> Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance.
arXiv Detail & Related papers (2025-06-11T06:59:17Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.<n>Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.<n>This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z)
Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime [59.27851754647913]
Predictive optimization is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising. We develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for advertising. Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO.
arXiv Detail & Related papers (2023-11-13T13:19:34Z)
DeLag: Using Multi-Objective Optimization to Enhance the Detection of Latency Degradation Patterns in Service-based Systems [0.76146285961466]
We present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems. DeLag simultaneously searches for multiple latency patterns while optimizing precision, recall and dissimilarity.
arXiv Detail & Related papers (2021-10-21T13:59:32Z)
Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.