Enabling collaborative data science development with the Ballet
framework
- URL: http://arxiv.org/abs/2012.07816v2
- Date: Tue, 6 Apr 2021 20:15:07 GMT
- Title: Enabling collaborative data science development with the Ballet
framework
- Authors: Micah J. Smith, J\"urgen Cito, Kelvin Lu, Kalyan Veeramachaneni
- Abstract summary: We present a novel conceptual framework and ML programming model to address challenges to scaling data science collaborations.
We instantiate these ideas in Ballet, a lightweight software framework for collaborative open-source data science.
- Score: 9.424574945499844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the open-source model for software development has led to successful
large-scale collaborations in building software systems, data science projects
are frequently developed by individuals or small groups. We describe challenges
to scaling data science collaborations and present a novel conceptual framework
and ML programming model to address them. We instantiate these ideas in Ballet,
a lightweight software framework for collaborative open-source data science and
a cloud-based development environment, with a plugin for collaborative feature
engineering. Using our framework, collaborators incrementally propose feature
definitions to a repository which are each subjected to an ML evaluation and
can be automatically merged into an executable feature engineering pipeline. We
leverage Ballet to conduct an extensive case study analysis of a real-world
income prediction problem, and discuss implications for collaborative projects.
Related papers
- Human-In-the-Loop Software Development Agents [12.830816751625829]
Large Language Models (LLMs) are introduced to automatically resolve software development tasks.
We introduce a Human-in-the-loop LLM-based Agents framework (HULA) for software development.
We design, implement, and deploy the HULA framework into Atlassian for internal uses.
arXiv Detail & Related papers (2024-11-19T23:22:33Z) - Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - On the Interaction between Software Engineers and Data Scientists when
building Machine Learning-Enabled Systems [1.2184324428571227]
Machine Learning (ML) components have been increasingly integrated into the core systems of organizations.
One of the key challenges is the effective interaction between actors with different backgrounds who need to work closely together.
This paper presents an exploratory case study to understand the current interaction and collaboration dynamics between these roles in ML projects.
arXiv Detail & Related papers (2024-02-08T00:27:56Z) - SoTaNa: The Open-Source Software Development Assistant [81.86136560157266]
SoTaNa is an open-source software development assistant.
It generates high-quality instruction-based data for the domain of software engineering.
It employs a parameter-efficient fine-tuning approach to enhance the open-source foundation model, LLaMA.
arXiv Detail & Related papers (2023-08-25T14:56:21Z) - Code Recommendation for Open Source Software Developers [32.181023933552694]
CODER is a novel graph-based code recommendation framework for open source software developers.
Our framework achieves superior performance under various experimental settings, including intra-project, cross-project, and cold-start recommendation.
arXiv Detail & Related papers (2022-10-15T16:40:36Z) - Assessing the Quality of Computational Notebooks for a Frictionless
Transition from Exploration to Production [1.332560004325655]
Data scientists must transition from the explorative phase of Machine Learning projects to their production phase.
To narrow the gap between these two phases, tools and practices adopted by data scientists might be improved by incorporating consolidated software engineering solutions.
In my research project, I study the best practices for collaboration with computational notebooks and propose proof-of-concept tools to foster guidelines compliance.
arXiv Detail & Related papers (2022-05-24T10:13:38Z) - YMIR: A Rapid Data-centric Development Platform for Vision Applications [82.67319997259622]
This paper introduces an open source platform for rapid development of computer vision applications.
The platform puts the efficient data development at the center of the machine learning development process.
arXiv Detail & Related papers (2021-11-19T05:02:55Z) - Distributed Deep Learning in Open Collaborations [49.240611132653456]
We propose a novel algorithmic framework designed specifically for collaborative training.
We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost.
arXiv Detail & Related papers (2021-06-18T16:23:13Z) - A Data-Centric Framework for Composable NLP Workflows [109.51144493023533]
Empirical natural language processing systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components.
We establish a unified open-source framework to support fast development of such sophisticated NLP in a composable manner.
arXiv Detail & Related papers (2021-03-02T16:19:44Z) - Representation of Developer Expertise in Open Source Software [12.583969739954526]
We use the World of Code infrastructure to extract the complete set of APIs in the files changed by open source developers.
We then employ Doc2Vec embeddings for vector representations of APIs, developers, and projects.
We evaluate if these embeddings reflect the postulated topology of the Skill Space.
arXiv Detail & Related papers (2020-05-20T16:36:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.