An Empirical Study on Method-Level Performance Evolution in Open-Source Java Projects
- URL: http://arxiv.org/abs/2508.07084v1
- Date: Sat, 09 Aug 2025 19:39:01 GMT
- Title: An Empirical Study on Method-Level Performance Evolution in Open-Source Java Projects
- Authors: Kaveh Shahedi, Nana Gyambrah, Heng Li, Maxime Lamothe, Foutse Khomh,
- Abstract summary: We conducted a large-scale empirical study analyzing performance evolution in 15 mature open-source Java projects.<n>Our findings reveal that 32.7% of method-level changes result in measurable performance impacts.<n>Algorithmic changes demonstrate the highest improvement potential but carry substantial regression risk.
- Score: 14.908341749591594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Performance is a critical quality attribute in software development, yet the impact of method-level code changes on performance evolution remains poorly understood. While developers often make intuitive assumptions about which types of modifications are likely to cause performance regressions or improvements, these beliefs lack empirical validation at a fine-grained level. We conducted a large-scale empirical study analyzing performance evolution in 15 mature open-source Java projects hosted on GitHub. Our analysis encompassed 739 commits containing 1,499 method-level code changes, using Java Microbenchmark Harness (JMH) for precise performance measurement and rigorous statistical analysis to quantify both the significance and magnitude of performance variations. We employed bytecode instrumentation to capture method-specific execution metrics and systematically analyzed four key aspects: temporal performance patterns, code change type correlations, developer and complexity factors, and domain-size interactions. Our findings reveal that 32.7% of method-level changes result in measurable performance impacts, with regressions occurring 1.3 times more frequently than improvements. Contrary to conventional wisdom, we found no significant differences in performance impact distributions across code change categories, challenging risk-stratified development strategies. Algorithmic changes demonstrate the highest improvement potential but carry substantial regression risk. Senior developers produce more stable changes with fewer extreme variations, while code complexity correlates with increased regression likelihood. Domain-size interactions reveal significant patterns, with web server + small projects exhibiting the highest performance instability. Our study provides empirical evidence for integrating automated performance testing into continuous integration pipelines.
Related papers
- Agentic Refactoring: An Empirical Study of AI Coding Agents [9.698067623031909]
Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape.<n>These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks.<n>There is a critical lack of empirical understanding regarding how agentic is utilized in practice, how it compares to human-driven, and what impact it has on code quality.
arXiv Detail & Related papers (2025-11-06T21:24:38Z) - ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z) - Refactoring $\neq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis [54.361900378970134]
Just-in-time defect prediction (JIT-DP) aims to predict the likelihood of code changes resulting in software defects at an early stage.<n>Prior research has largely ignored code during both the evaluation and methodology phases, despite its prevalence.<n>We propose Code chAnge Tactics (CAT) analysis to categorize code and its propagation, which improves labeling accuracy in the JIT-Defects4J dataset by 13.7%.
arXiv Detail & Related papers (2025-07-25T23:29:25Z) - Scaling Test-time Compute for LLM Agents [51.790752085445384]
Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs)<n>In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents.
arXiv Detail & Related papers (2025-06-15T17:59:47Z) - Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models [7.486731499255164]
This paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI.<n>We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies.
arXiv Detail & Related papers (2025-06-12T07:24:59Z) - The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks [0.8388858079753069]
This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques across 14 different Machine Learning algorithms and 16 datasets for classification and regression tasks.<n>We meticulously analyzed impacts on predictive performance (using metrics such as accuracy, MAE, MSE, and $R2$) and computational costs (training time, inference time, and memory usage).
arXiv Detail & Related papers (2025-06-09T22:32:51Z) - Identifying and Replicating Code Patterns Driving Performance Regressions in Software Systems [7.030339427131108]
Performance mutation testing introduces intentional defects to measure and enhance fault-detection capabilities.<n>A key challenge is understanding if generated mutants accurately reflect real-world performance issues.<n>This study evaluates and extends mutation operators for performance testing.
arXiv Detail & Related papers (2025-04-08T09:28:46Z) - On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations [53.0667196725616]
Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment.<n>DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games.<n>Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist.
arXiv Detail & Related papers (2025-03-28T16:25:06Z) - Towards Understanding the Impact of Code Modifications on Software Quality Metrics [1.2277343096128712]
This study aims to assess and interpret the impact of code modifications on software quality metrics.
The underlying hypothesis posits that code modifications inducing similar changes in software quality metrics can be grouped into distinct clusters.
The results reveal distinct clusters of code modifications, each accompanied by a concise description, revealing their collective impact on software quality metrics.
arXiv Detail & Related papers (2024-04-05T08:41:18Z) - Large Language Models are Miscalibrated In-Context Learners [22.30783674111999]
In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods.<n>We observe that the miscalibration problem exists across all learning methods in low-resource setups.<n>We find that self-ensembling with max probability produces robust and calibrated predictions.
arXiv Detail & Related papers (2023-12-21T11:55:10Z) - Evolving Pareto-Optimal Actor-Critic Algorithms for Generalizability and
Stability [67.8426046908398]
Generalizability and stability are two key objectives for operating reinforcement learning (RL) agents in the real world.
This paper presents MetaPG, an evolutionary method for automated design of actor-critic loss functions.
arXiv Detail & Related papers (2022-04-08T20:46:16Z) - Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks.
We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.