A Data Source Dependency Analysis Framework for Large Scale Data Science
Projects
- URL: http://arxiv.org/abs/2212.07951v1
- Date: Thu, 15 Dec 2022 16:34:39 GMT
- Title: A Data Source Dependency Analysis Framework for Large Scale Data Science
Projects
- Authors: Laurent Bou\'e and Pratap Kunireddy and Pavle Suboti\'c
- Abstract summary: Data source dependency hell refers to the central role played by data and its unique quirks that often lead to unexpected failures of machine learning models.
We present an automated dependency mapping framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Dependency hell is a well-known pain point in the development of large
software projects and machine learning (ML) code bases are not immune from it.
In fact, ML applications suffer from an additional form, namely, "data source
dependency hell". This term refers to the central role played by data and its
unique quirks that often lead to unexpected failures of ML models which cannot
be explained by code changes. In this paper, we present an automated dependency
mapping framework that allows MLOps engineers to monitor the whole dependency
map of their models in a fast paced engineering environment and thus mitigate
ahead of time the consequences of any data source changes (e.g., re-train
model, ignore data, set default data etc.). Our system is based on a unified
and generic approach, employing techniques from static analysis, from which
data sources can be identified reliably for any type of dependency on a wide
range of source languages and artefacts. The dependency mapping framework is
exposed as a REST web API where the only input is the path to the Git
repository hosting the code base. Currently used by MLOps engineers at
Microsoft, we expect such dependency map APIs to be adopted more widely by
MLOps engineers in the future.
Related papers
- Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process [8.207427766052044]
The proposed approach is demonstrated on and analyzed through two mathematical and two materials science case studies.
It is observed that compared to using single-source and source unaware machine learning models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems.
arXiv Detail & Related papers (2024-02-06T16:54:59Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - Machine Learning-Enabled Software and System Architecture Frameworks [48.87872564630711]
The stakeholders with data science and Machine Learning related concerns, such as data scientists and data engineers, are yet to be included in existing architecture frameworks.
We surveyed 61 subject matter experts from over 25 organizations in 10 countries.
arXiv Detail & Related papers (2023-08-09T21:54:34Z) - A Preliminary Investigation of MLOps Practices in GitHub [10.190501703364234]
Machine learning applications have led to an increasing interest in MLOps.
We present an initial investigation of the MLOps practices implemented in a set of ML-enabled systems retrieved from GitHub.
arXiv Detail & Related papers (2022-09-23T07:29:56Z) - Demystifying Dependency Bugs in Deep Learning Stack [7.488059560714949]
This paper characterizes symptoms, root causes and fix patterns of dependency bugs (DBs) across the whole Deep Learning stack.
Our findings shed light on practical implications on dependency management.
arXiv Detail & Related papers (2022-07-21T07:56:03Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Deep Transfer Learning for Multi-source Entity Linkage via Domain
Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching.
AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage.
Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z) - Exploring the potential of flow-based programming for machine learning
deployment in comparison with service-oriented architectures [8.677012233188968]
We argue that part of the reason is infrastructure that was not designed for activities around data collection and analysis.
We propose to consider flow-based programming with data streams as an alternative to commonly used service-oriented architectures for building software applications.
arXiv Detail & Related papers (2021-08-09T15:06:02Z) - The Prevalence of Code Smells in Machine Learning projects [9.722159563454436]
static code analysis can be used to find potential defects in the source code, opportunities, and violations of common coding standards.
We gathered a dataset of 74 open-source Machine Learning projects, installed their dependencies and ran Pylint on them.
This resulted in a top 20 of all detected code smells, per category.
arXiv Detail & Related papers (2021-03-06T16:01:54Z) - Learning to Generalize Unseen Domains via Memory-based Multi-Source
Meta-Learning for Person Re-Identification [59.326456778057384]
We propose the Memory-based Multi-Source Meta-Learning framework to train a generalizable model for unseen domains.
We also present a meta batch normalization layer (MetaBN) to diversify meta-test features.
Experiments demonstrate that our M$3$L can effectively enhance the generalization ability of the model for unseen domains.
arXiv Detail & Related papers (2020-12-01T11:38:16Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.