Related papers: The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products

The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products

URL: http://arxiv.org/abs/2308.04328v2
Date: Thu, 15 Aug 2024 04:43:35 GMT
Title: The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products
Authors: Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, Christian Kästner,
Abstract summary: This study contributes a dataset of 262 open-source ML products for end users, identified among more than half a million ML-related projects on GitHub. We find that the majority of the ML products in our sample represent more startup-style development than reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many open-source ML products.
Score: 24.142477108938856
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, hindering research progress to address these challenges. In this study, first and foremost, we contribute a dataset of 262 open-source ML products for end users (not just models), identified among more than half a million ML-related projects on GitHub. Then, we qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture. We find that the majority of the ML products in our sample represent more startup-style development than reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many open-source ML products, unusually low modularity between ML and non-ML code, diverse architectural choices on incorporating models into products, and limited prevalence of industry best practices such as model testing, pipeline automation, and monitoring. Additionally, we discuss seven implications of this study on research, development, and education, including the need for tools to assist teams without data scientists, education opportunities, and open-source-specific research for privacy-preserving telemetry.

Related papers

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis. Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs. Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z)
A Large-Scale Study of Model Integration in ML-Enabled Software Systems [4.776073133338119]
Machine learning (ML) and its embedding in systems has drastically changed the engineering of software-intensive systems. Traditionally, software engineering focuses on manually created artifacts such as source code and the process of creating them. We present the first large-scale study of real ML-enabled software systems, covering over 2,928 open source systems on GitHub.
arXiv Detail & Related papers (2024-08-12T15:28:40Z)
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z)
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows [72.40917624485822]
We introduce DataDreamer, an open source Python library that allows researchers to implement powerful large language models. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science.
arXiv Detail & Related papers (2024-02-16T00:10:26Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
EDALearn: A Comprehensive RTL-to-Signoff EDA Benchmark for Democratized and Reproducible ML for EDA Research [5.093676641214663]
We introduce EDALearn, the first holistic, open-source benchmark suite specifically for Machine Learning tasks in EDA. This benchmark suite presents an end-to-end flow from synthesis to physical implementation, enriching data collection across various stages. Our contributions aim to encourage further advances in the ML-EDA domain.
arXiv Detail & Related papers (2023-12-04T06:51:46Z)
A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot. This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
Machine Learning for Software Engineering: A Tertiary Study [13.832268599253412]
Machine learning (ML) techniques increase the effectiveness of software engineering (SE) lifecycle activities. We systematically collected, quality-assessed, summarized, and categorized 83 reviews in ML for SE published between 2009-2022, covering 6,117 primary studies. The SE areas most tackled with ML are software quality and testing, while human-centered areas appear more challenging for ML.
arXiv Detail & Related papers (2022-11-17T09:19:53Z)
Machine Learning Operations (MLOps): Overview, Definition, and Architecture [0.0]
The paradigm of Machine Learning Operations (MLOps) addresses this issue. MLOps is still a vague term and its consequences for researchers and professionals are ambiguous. We provide an aggregated overview of the necessary components, and roles, as well as the associated architecture and principles.
arXiv Detail & Related papers (2022-05-04T19:38:48Z)
Widening Access to Applied Machine Learning with TinyML [1.1678513163359947]
We describe our pedagogical approach to increasing access to applied machine-learning (ML) through a massive open online course (MOOC) on Tiny Machine Learning (TinyML) To this end, a collaboration between academia (Harvard University) and industry (Google) produced a four-part MOOC that provides application-oriented instruction on how to develop solutions using TinyML. The series is openly available on the edX MOOC platform, has no prerequisites beyond basic programming, and is designed for learners from a global variety of backgrounds.
arXiv Detail & Related papers (2021-06-07T23:31:47Z)
Empirical Study on the Software Engineering Practices in Open Source ML Package Repositories [6.2894222252929985]
Modern Machine Learning technologies require considerable technical expertise and resources to develop, train and deploy such models. Such discovery and reuse by practitioners and researchers are being addressed by public ML package repositories. This paper conducts an exploratory study that analyzes the structure and contents of two popular ML package repositories.
arXiv Detail & Related papers (2020-12-02T18:52:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.