Related papers: Continuous Integration Practices in Machine Learning Projects: The Practitioners` Perspective

Continuous Integration Practices in Machine Learning Projects: The Practitioners` Perspective

URL: http://arxiv.org/abs/2502.17378v1
Date: Mon, 24 Feb 2025 18:01:50 GMT
Title: Continuous Integration Practices in Machine Learning Projects: The Practitioners` Perspective
Authors: João Helis Bernardo, Daniel Alencar da Costa, Filipe Roseiro Cogo, Sérgio Queiróz de Medeiros, Uirá Kulesza,
Abstract summary: This study surveys 155 practitioners from 47 Machine Learning (ML) projects.<n> Practitioners highlighted eight key differences, including test complexity, infrastructure requirements, and build duration and stability.<n>Common challenges mentioned by practitioners include higher project complexity, model training demands, extensive data handling, increased computational resource needs, and dependency management.
Score: 1.4165457606269516
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continuous Integration (CI) is a cornerstone of modern software development. However, while widely adopted in traditional software projects, applying CI practices to Machine Learning (ML) projects presents distinctive characteristics. For example, our previous work revealed that ML projects often experience longer build durations and lower test coverage rates compared to their non-ML counterparts. Building on these quantitative findings, this study surveys 155 practitioners from 47 ML projects to investigate the underlying reasons for these distinctive characteristics through a qualitative perspective. Practitioners highlighted eight key differences, including test complexity, infrastructure requirements, and build duration and stability. Common challenges mentioned by practitioners include higher project complexity, model training demands, extensive data handling, increased computational resource needs, and dependency management, all contributing to extended build durations. Furthermore, ML systems' non-deterministic nature, data dependencies, and computational constraints were identified as significant barriers to effective testing. The key takeaway from this study is that while foundational CI principles remain valuable, ML projects require tailored approaches to address their unique challenges. To bridge this gap, we propose a set of ML-specific CI practices, including tracking model performance metrics and prioritizing test execution within CI pipelines. Additionally, our findings highlight the importance of fostering interdisciplinary collaboration to strengthen the testing culture in ML projects. By bridging quantitative findings with practitioners' insights, this study provides a deeper understanding of the interplay between CI practices and the unique demands of ML projects, laying the groundwork for more efficient and robust CI strategies in this domain.

Related papers

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective [53.594353527056775]
We propose Chinese Commonsense Multi-hop Reasoning ( CCMOR) to evaluate Large Language Models (LLMs)<n> CCMOR is designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning.<n>We implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions.
arXiv Detail & Related papers (2025-10-09T20:29:00Z)
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints [100.02131897927484]
This paper focuses on the native training of Multimodal Large Language Models (MLLMs) in an end-to-end manner.<n>We propose a native MLLM called NaViL, combined with a simple and cost-effective recipe.<n> Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs.
arXiv Detail & Related papers (2025-10-09T17:59:37Z)
Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning [19.4760649326684]
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines.<n>With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings.<n>Existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks.
arXiv Detail & Related papers (2025-05-16T11:01:01Z)
Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z)
Exploring Individual Factors in the Adoption of LLMs for Specific Software Engineering Tasks [17.818350887316004]
This study explores the relationship between individual attributes related to technology adoption and Large Language Models (LLMs) The findings reveal that task-specific adoption is influenced by distinct factors, some of which negatively impact adoption when considered in isolation.
arXiv Detail & Related papers (2025-04-03T13:07:04Z)
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models [52.569132872560814]
multimodal large language models (MLLMs) have achieved significant breakthroughs, enhancing understanding across text and vision. However, current MLLMs still face challenges in effectively integrating knowledge across these modalities during multimodal knowledge reasoning. We analyze and compare the extent of consistency degradation in multimodal knowledge reasoning within MLLMs.
arXiv Detail & Related papers (2025-03-03T09:01:51Z)
An Overview of Large Language Models for Statisticians [109.38601458831545]
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI) This paper explores potential areas where statisticians can make important contributions to the development of LLMs. We focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation.
arXiv Detail & Related papers (2025-02-25T03:40:36Z)
A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models [0.0]
We propose a comprehensive approach to benchmark development based on rigorous psychometric principles. We make the first attempt to illustrate this approach by creating a new benchmark in the field of pedagogy and education. We construct a novel benchmark guided by the Bloom's taxonomy and rigorously designed by a consortium of education experts trained in test development.
arXiv Detail & Related papers (2024-10-29T19:32:43Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
How do Machine Learning Projects use Continuous Integration Practices? An Empirical Study on GitHub Actions [1.5197353881052764]
We conduct a comprehensive analysis of 185 open-source projects on GitHub (93 ML and 92 non-ML projects) Our investigation comprises both quantitative and qualitative dimensions, aiming to uncover differences in CI adoption between ML and non-ML projects. Our findings indicate that ML projects often require longer build durations, and medium-sized ML projects exhibit lower test coverage compared to non-ML projects.
arXiv Detail & Related papers (2024-03-14T16:35:39Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
A Case Study on Test Case Construction with Large Language Models: Unveiling Practical Insights and Challenges [2.7029792239733914]
This paper examines the application of Large Language Models in the construction of test cases within the context of software engineering. Through a blend of qualitative and quantitative analyses, this study assesses the impact of LLMs on test case comprehensiveness, accuracy, and efficiency.
arXiv Detail & Related papers (2023-12-19T20:59:02Z)
When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks [54.71034943526973]
In-context learning (ICL) has become the default method for using large language models (LLMs) We find that ICL falls short of handling specification-heavy tasks, which are tasks with complicated and extensive task specifications. We identify three primary reasons: inability to specifically understand context, misalignment in task schema comprehension with humans, and inadequate long-text understanding ability.
arXiv Detail & Related papers (2023-11-15T14:26:30Z)
Identifying Concerns When Specifying Machine Learning-Enabled Systems: A Perspective-Based Approach [1.2184324428571227]
PerSpecML is a perspective-based approach for specifying ML-enabled systems. It helps practitioners identify which attributes, including ML and non-ML components, are important to contribute to the overall system's quality.
arXiv Detail & Related papers (2023-09-14T18:31:16Z)
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [74.22729793816451]
Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability. We propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization. We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems.
arXiv Detail & Related papers (2023-05-23T17:51:52Z)
Understanding the Usability Challenges of Machine Learning In High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains. In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions. We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.