Related papers: Empirical Study on the Software Engineering Practices in Open Source ML Package Repositories

Empirical Study on the Software Engineering Practices in Open Source ML Package Repositories

URL: http://arxiv.org/abs/2012.01403v2
Date: Tue, 8 Dec 2020 16:02:00 GMT
Title: Empirical Study on the Software Engineering Practices in Open Source ML Package Repositories
Authors: Minke Xiu, Ellis E. Eghan, Zhen Ming (Jack) Jiang, Bram Adams
Abstract summary: Modern Machine Learning technologies require considerable technical expertise and resources to develop, train and deploy such models. Such discovery and reuse by practitioners and researchers are being addressed by public ML package repositories. This paper conducts an exploratory study that analyzes the structure and contents of two popular ML package repositories.
Score: 6.2894222252929985
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in Artificial Intelligence (AI), especially in Machine Learning (ML), have introduced various practical applications (e.g., virtual personal assistants and autonomous cars) that enhance the experience of everyday users. However, modern ML technologies like Deep Learning require considerable technical expertise and resources to develop, train and deploy such models, making effective reuse of the ML models a necessity. Such discovery and reuse by practitioners and researchers are being addressed by public ML package repositories, which bundle up pre-trained models into packages for publication. Since such repositories are a recent phenomenon, there is no empirical data on their current state and challenges. Hence, this paper conducts an exploratory study that analyzes the structure and contents of two popular ML package repositories, TFHub and PyTorch Hub, comparing their information elements (features and policies), package organization, package manager functionalities and usage contexts against popular software package repositories (npm, PyPI, and CRAN). Through these studies, we have identified unique SE practices and challenges for sharing ML packages. These findings and implications would be useful for data scientists, researchers and software developers who intend to use these shared ML packages.

Related papers

ExeKGLib: A Platform for Machine Learning Analytics based on Knowledge Graphs [6.611237989022405]
We present ExeKGLib, a Python library enhanced with a graphical interface layer that allows users with minimal ML knowledge to build ML pipelines.<n>This is achieved by relying on knowledge graphs that encode ML knowledge in simple terms to non-ML experts.
arXiv Detail & Related papers (2025-08-01T07:45:49Z)
How do Pre-Trained Models Support Software Engineering? An Empirical Study in Hugging Face [52.257764273141184]
Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks.<n>These resources lack a classification tailored to Software Engineering (SE) needs.<n>We derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF)<n>We find that code generation is the most common SE task among PTMs, while requirements engineering and software design activities receive limited attention.
arXiv Detail & Related papers (2025-06-03T15:51:17Z)
Towards a Classification of Open-Source ML Models and Datasets for Software Engineering [52.257764273141184]
Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks. These resources lack a classification tailored to Software Engineering (SE) needs. We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
arXiv Detail & Related papers (2024-11-14T18:52:05Z)
On the Creation of Representative Samples of Software Repositories [1.8599311233727087]
With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. Current sampling methods are often based on random selection or rely on variables which may not be related to the research study. We present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study.
arXiv Detail & Related papers (2024-10-01T12:41:15Z)
A Large-Scale Study of Model Integration in ML-Enabled Software Systems [4.776073133338119]
Machine learning (ML) and its embedding in systems has drastically changed the engineering of software-intensive systems. Traditionally, software engineering focuses on manually created artifacts such as source code and the process of creating them. We present the first large-scale study of real ML-enabled software systems, covering over 2,928 open source systems on GitHub.
arXiv Detail & Related papers (2024-08-12T15:28:40Z)
Wildest Dreams: Reproducible Research in Privacy-preserving Neural Network Training [2.853180143237022]
This work focuses on the ML model's training phase, where maintaining user data privacy is of utmost importance. We provide a solid theoretical background that eases the understanding of current approaches. We reproduce results for some of the papers and examine at what level existing works in the field provide support for open science.
arXiv Detail & Related papers (2024-03-06T10:25:36Z)
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows [72.40917624485822]
We introduce DataDreamer, an open source Python library that allows researchers to implement powerful large language models. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science.
arXiv Detail & Related papers (2024-02-16T00:10:26Z)
Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation. We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques. We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z)
Machine Learning-Enabled Software and System Architecture Frameworks [48.87872564630711]
The stakeholders with data science and Machine Learning related concerns, such as data scientists and data engineers, are yet to be included in existing architecture frameworks. We surveyed 61 subject matter experts from over 25 organizations in 10 countries.
arXiv Detail & Related papers (2023-08-09T21:54:34Z)
The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products [24.142477108938856]
This study contributes a dataset of 262 open-source ML products for end users, identified among more than half a million ML-related projects on GitHub. We find that the majority of the ML products in our sample represent more startup-style development than reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many open-source ML products.
arXiv Detail & Related papers (2023-08-08T15:19:13Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
OmniForce: On Human-Centered, Large Model Empowered and Cloud-Edge Collaborative AutoML System [85.8338446357469]
We introduce OmniForce, a human-centered AutoML system that yields both human-assisted ML and ML-assisted human techniques. We show how OmniForce can put an AutoML system into practice and build adaptive AI in open-environment scenarios.
arXiv Detail & Related papers (2023-03-01T13:35:22Z)
A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems. ML models often remember' the old data. Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z)
What can Data-Centric AI Learn from Data and ML Engineering? [17.247372757533185]
Data-centric AI is a new and exciting research topic in the AI community. Many organizations already build and maintain various "data-centric" applications. We discuss several lessons from data and ML engineering that could be interesting to apply in data-centric AI.
arXiv Detail & Related papers (2021-12-13T06:40:05Z)
Enabling Un-/Semi-Supervised Machine Learning for MDSE of the Real-World CPS/IoT Applications [0.5156484100374059]
We propose a novel approach to support domain-specific Model-Driven Software Engineering (MDSE) for the real-world use-case scenarios of smart Cyber-Physical Systems (CPS) and the Internet of Things (IoT) We argue that the majority of available data in the nature for Artificial Intelligence (AI) are unlabeled. Hence, unsupervised and/or semi-supervised ML approaches are the practical choices. Our proposed approach is fully implemented and integrated with an existing state-of-the-art MDSE tool to serve the CPS/IoT domain.
arXiv Detail & Related papers (2021-07-06T15:51:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.