Empirical Study on the Software Engineering Practices in Open Source ML
Package Repositories
- URL: http://arxiv.org/abs/2012.01403v2
- Date: Tue, 8 Dec 2020 16:02:00 GMT
- Title: Empirical Study on the Software Engineering Practices in Open Source ML
Package Repositories
- Authors: Minke Xiu, Ellis E. Eghan, Zhen Ming (Jack) Jiang, Bram Adams
- Abstract summary: Modern Machine Learning technologies require considerable technical expertise and resources to develop, train and deploy such models.
Such discovery and reuse by practitioners and researchers are being addressed by public ML package repositories.
This paper conducts an exploratory study that analyzes the structure and contents of two popular ML package repositories.
- Score: 6.2894222252929985
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in Artificial Intelligence (AI), especially in Machine
Learning (ML), have introduced various practical applications (e.g., virtual
personal assistants and autonomous cars) that enhance the experience of
everyday users. However, modern ML technologies like Deep Learning require
considerable technical expertise and resources to develop, train and deploy
such models, making effective reuse of the ML models a necessity. Such
discovery and reuse by practitioners and researchers are being addressed by
public ML package repositories, which bundle up pre-trained models into
packages for publication. Since such repositories are a recent phenomenon,
there is no empirical data on their current state and challenges. Hence, this
paper conducts an exploratory study that analyzes the structure and contents of
two popular ML package repositories, TFHub and PyTorch Hub, comparing their
information elements (features and policies), package organization, package
manager functionalities and usage contexts against popular software package
repositories (npm, PyPI, and CRAN). Through these studies, we have identified
unique SE practices and challenges for sharing ML packages. These findings and
implications would be useful for data scientists, researchers and software
developers who intend to use these shared ML packages.
Related papers
- Towards a Classification of Open-Source ML Models and Datasets for Software Engineering [52.257764273141184]
Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks.
These resources lack a classification tailored to Software Engineering (SE) needs.
We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
arXiv Detail & Related papers (2024-11-14T18:52:05Z) - On the Creation of Representative Samples of Software Repositories [1.8599311233727087]
With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies.
Current sampling methods are often based on random selection or rely on variables which may not be related to the research study.
We present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study.
arXiv Detail & Related papers (2024-10-01T12:41:15Z) - A Large-Scale Study of Model Integration in ML-Enabled Software Systems [4.776073133338119]
Machine learning (ML) and its embedding in systems has drastically changed the engineering of software-intensive systems.
Traditionally, software engineering focuses on manually created artifacts such as source code and the process of creating them.
We present the first large-scale study of real ML-enabled software systems, covering over 2,928 open source systems on GitHub.
arXiv Detail & Related papers (2024-08-12T15:28:40Z) - Wildest Dreams: Reproducible Research in Privacy-preserving Neural
Network Training [2.853180143237022]
This work focuses on the ML model's training phase, where maintaining user data privacy is of utmost importance.
We provide a solid theoretical background that eases the understanding of current approaches.
We reproduce results for some of the papers and examine at what level existing works in the field provide support for open science.
arXiv Detail & Related papers (2024-03-06T10:25:36Z) - DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows [72.40917624485822]
We introduce DataDreamer, an open source Python library that allows researchers to implement powerful large language models.
DataDreamer also helps researchers adhere to best practices that we propose to encourage open science.
arXiv Detail & Related papers (2024-02-16T00:10:26Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques.
We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - Machine Learning-Enabled Software and System Architecture Frameworks [48.87872564630711]
The stakeholders with data science and Machine Learning related concerns, such as data scientists and data engineers, are yet to be included in existing architecture frameworks.
We surveyed 61 subject matter experts from over 25 organizations in 10 countries.
arXiv Detail & Related papers (2023-08-09T21:54:34Z) - The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products [24.142477108938856]
This study contributes a dataset of 262 open-source ML products for end users, identified among more than half a million ML-related projects on GitHub.
We find that the majority of the ML products in our sample represent more startup-style development than reported in past interview studies.
We report 21 findings, including limited involvement of data scientists in many open-source ML products.
arXiv Detail & Related papers (2023-08-08T15:19:13Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems.
ML models often remember' the old data.
Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z) - Enabling Un-/Semi-Supervised Machine Learning for MDSE of the Real-World
CPS/IoT Applications [0.5156484100374059]
We propose a novel approach to support domain-specific Model-Driven Software Engineering (MDSE) for the real-world use-case scenarios of smart Cyber-Physical Systems (CPS) and the Internet of Things (IoT)
We argue that the majority of available data in the nature for Artificial Intelligence (AI) are unlabeled. Hence, unsupervised and/or semi-supervised ML approaches are the practical choices.
Our proposed approach is fully implemented and integrated with an existing state-of-the-art MDSE tool to serve the CPS/IoT domain.
arXiv Detail & Related papers (2021-07-06T15:51:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.