Related papers: A Data Source Dependency Analysis Framework for Large Scale Data Science Projects

A Data Source Dependency Analysis Framework for Large Scale Data Science Projects

URL: http://arxiv.org/abs/2212.07951v1
Date: Thu, 15 Dec 2022 16:34:39 GMT
Title: A Data Source Dependency Analysis Framework for Large Scale Data Science Projects
Authors: Laurent Bou\'e and Pratap Kunireddy and Pavle Suboti\'c
Abstract summary: Data source dependency hell refers to the central role played by data and its unique quirks that often lead to unexpected failures of machine learning models. We present an automated dependency mapping framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Dependency hell is a well-known pain point in the development of large software projects and machine learning (ML) code bases are not immune from it. In fact, ML applications suffer from an additional form, namely, "data source dependency hell". This term refers to the central role played by data and its unique quirks that often lead to unexpected failures of ML models which cannot be explained by code changes. In this paper, we present an automated dependency mapping framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment and thus mitigate ahead of time the consequences of any data source changes (e.g., re-train model, ignore data, set default data etc.). Our system is based on a unified and generic approach, employing techniques from static analysis, from which data sources can be identified reliably for any type of dependency on a wide range of source languages and artefacts. The dependency mapping framework is exposed as a REST web API where the only input is the path to the Git repository hosting the code base. Currently used by MLOps engineers at Microsoft, we expect such dependency map APIs to be adopted more widely by MLOps engineers in the future.

Related papers

LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World [0.0]
This paper studies the capability of Large Language Models to generate architectural components for Functions as a Service (F) The small size of their architectural components make this architectural style amenable for generation using current LLMs. We evaluate correctness through existing tests present in the repositories and use metrics from the Software Engineering (SE) and Natural Language Processing (NLP) domains.
arXiv Detail & Related papers (2025-02-04T18:06:04Z)
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z)
Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process [8.207427766052044]
The proposed approach is demonstrated on and analyzed through two mathematical and two materials science case studies. It is observed that compared to using single-source and source unaware machine learning models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems.
arXiv Detail & Related papers (2024-02-06T16:54:59Z)
Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z)
Machine Learning-Enabled Software and System Architecture Frameworks [48.87872564630711]
The stakeholders with data science and Machine Learning related concerns, such as data scientists and data engineers, are yet to be included in existing architecture frameworks. We surveyed 61 subject matter experts from over 25 organizations in 10 countries.
arXiv Detail & Related papers (2023-08-09T21:54:34Z)
A Preliminary Investigation of MLOps Practices in GitHub [10.190501703364234]
Machine learning applications have led to an increasing interest in MLOps. We present an initial investigation of the MLOps practices implemented in a set of ML-enabled systems retrieved from GitHub.
arXiv Detail & Related papers (2022-09-23T07:29:56Z)
Demystifying Dependency Bugs in Deep Learning Stack [7.488059560714949]
This paper characterizes symptoms, root causes and fix patterns of dependency bugs (DBs) across the whole Deep Learning stack. Our findings shed light on practical implications on dependency management.
arXiv Detail & Related papers (2022-07-21T07:56:03Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching. AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage. Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z)
Exploring the potential of flow-based programming for machine learning deployment in comparison with service-oriented architectures [8.677012233188968]
We argue that part of the reason is infrastructure that was not designed for activities around data collection and analysis. We propose to consider flow-based programming with data streams as an alternative to commonly used service-oriented architectures for building software applications.
arXiv Detail & Related papers (2021-08-09T15:06:02Z)
The Prevalence of Code Smells in Machine Learning projects [9.722159563454436]
static code analysis can be used to find potential defects in the source code, opportunities, and violations of common coding standards. We gathered a dataset of 74 open-source Machine Learning projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category.
arXiv Detail & Related papers (2021-03-06T16:01:54Z)
Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification [59.326456778057384]
We propose the Memory-based Multi-Source Meta-Learning framework to train a generalizable model for unseen domains. We also present a meta batch normalization layer (MetaBN) to diversify meta-test features. Experiments demonstrate that our M$3$L can effectively enhance the generalization ability of the model for unseen domains.
arXiv Detail & Related papers (2020-12-01T11:38:16Z)
KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT) All tasks in KILT are grounded in the same snapshot of Wikipedia. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.