Performance Smells in ML and Non-ML Python Projects: A Comparative Study
- URL: http://arxiv.org/abs/2504.20224v1
- Date: Mon, 28 Apr 2025 19:48:26 GMT
- Title: Performance Smells in ML and Non-ML Python Projects: A Comparative Study
- Authors: François Belias, Leuson Da Silva, Foutse Khomh, Cyrine Zid,
- Abstract summary: This study provides a comparative analysis of performance smells between Machine Learning and non-ML projects.<n>Our results indicate that ML projects are more susceptible to performance smells due to the computational and data-intensive nature of ML.<n>Our study underscores the need to tailor performance optimization strategies to the unique characteristics of ML projects.
- Score: 10.064805853389277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Python is widely adopted across various domains, especially in Machine Learning (ML) and traditional software projects. Despite its versatility, Python is susceptible to performance smells, i.e., suboptimal coding practices that can reduce application efficiency. This study provides a comparative analysis of performance smells between ML and non-ML projects, aiming to assess the occurrence of these inefficiencies while exploring their distribution across stages in the ML pipeline. For that, we conducted an empirical study analyzing 300 Python-based GitHub projects, distributed across ML and non-ML projects, categorizing performance smells based on the RIdiom tool. Our results indicate that ML projects are more susceptible to performance smells likely due to the computational and data-intensive nature of ML workflows. We also observed that performance smells in the ML pipeline predominantly affect the Data Processing stage. However, their presence in the Model Deployment stage indicates that such smells are not limited to the early stages of the pipeline. Our findings offer actionable insights for developers, emphasizing the importance of targeted optimizations for smells prevalent in ML projects. Furthermore, our study underscores the need to tailor performance optimization strategies to the unique characteristics of ML projects, with particular attention to the pipeline stages most affected by performance smells.
Related papers
- MLScent A tool for Anti-pattern detection in ML projects [5.669063174637433]
This paper introduces MLScent, a novel static analysis tool for code smell detection.<n>MLScent implements 76 distinct detectors across major machine learning frameworks.<n>Results show high accuracy in framework-specific anti-patterns, data handling issues, and general ML code smells.
arXiv Detail & Related papers (2025-01-30T11:19:16Z) - Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning [104.27224674122313]
Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks.
To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions.
arXiv Detail & Related papers (2024-11-17T01:16:37Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more computation-efficient metric for performance estimation.<n>We present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Performance Law of Large Language Models [58.32539851241063]
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources.
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
arXiv Detail & Related papers (2024-08-19T11:09:12Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning [0.0]
This paper addresses a critical issue in Machine Learning (ML) where unintended information contaminates the training data, impacting model performance evaluation.
The discrepancy between evaluated and actual performance on new data is a significant concern.
It explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks.
arXiv Detail & Related papers (2024-01-24T20:30:52Z) - GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation [6.525197444717069]
GEVO-ML is a tool for discovering optimization opportunities and tuning the performance of Machine Learning kernels.
We demonstrate GEVO-ML on two different ML workloads for both model training and prediction.
GEVO-ML finds significant improvements for these models, achieving 90.43% performance improvement when model accuracy is relaxed by 2%.
arXiv Detail & Related papers (2023-10-16T09:24:20Z) - Reasonable Scale Machine Learning with Open-Source Metaflow [2.637746074346334]
We argue that re-purposing existing tools won't solve the current productivity issues.
We introduce Metaflow, an open-source framework for ML projects explicitly designed to boost the productivity of data practitioners.
arXiv Detail & Related papers (2023-03-21T11:28:09Z) - Exploring Opportunistic Meta-knowledge to Reduce Search Spaces for
Automated Machine Learning [8.325359814939517]
This paper investigates whether, based on previous experience, a pool of available classifiers/regressors can be preemptively culled ahead of initiating a pipeline composition/optimisation process.
arXiv Detail & Related papers (2021-05-01T15:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.