Managed Geo-Distributed Feature Store: Architecture and System Design
- URL: http://arxiv.org/abs/2305.20077v1
- Date: Wed, 31 May 2023 17:51:30 GMT
- Title: Managed Geo-Distributed Feature Store: Architecture and System Design
- Authors: Anya Li, Bhala Ranganathan, Feng Pan, Mickey Zhang, Qianjun Xu, Runhan
Li, Sethu Raman, Shail Paragbhai Shah, Vivienne Tang (Microsoft)
- Abstract summary: Companies are using machine learning to solve real-world problems and are developing hundreds to thousands of features in the process.
Without feature stores, different teams across various business groups would maintain the above process independently.
This paper aims to capture the core architectural components that make up a managed feature store and to share the design learning in building such a system.
- Score: 1.1809647985607934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Companies are using machine learning to solve real-world problems and are
developing hundreds to thousands of features in the process. They are building
feature engineering pipelines as part of MLOps life cycle to transform data
from various data sources and materialize the same for future consumption.
Without feature stores, different teams across various business groups would
maintain the above process independently, which can lead to conflicting and
duplicated features in the system. Data scientists find it hard to search for
and reuse existing features and it is painful to maintain version control.
Furthermore, feature correctness violations related to online (inferencing) -
offline (training) skews and data leakage are common. Although the machine
learning community has extensively discussed the need for feature stores and
their purpose, this paper aims to capture the core architectural components
that make up a managed feature store and to share the design learning in
building such a system.
Related papers
- Digital Asset Data Lakehouse. The concept based on a blockchain research center [0.0]
This paper introduces a novel software architecture designed to meet the demand for robust, scalable, and secure data management platforms.
We detail the architectural design, including its components and interactions, and discuss how it addresses common challenges in managing blockchain data and digital assets.
Our results indicate that the proposed architecture not only enhances the efficiency and scalability of distributed data management but also opens new avenues for innovation in the research area.
arXiv Detail & Related papers (2025-03-20T09:12:39Z) - What is a Feature, Really? Toward a Unified Understanding Across SE Disciplines [0.7125007887148752]
In software engineering, the concept of a feature'' is inconsistently defined across disciplines such as requirements engineering (RE) and software product lines (SPL)
This paper proposes an empirical, data-driven approach to explore how features are described, implemented, and managed across real-world projects.
arXiv Detail & Related papers (2025-02-14T09:08:53Z) - Stalactite: Toolbox for Fast Prototyping of Vertical Federated Learning Systems [37.11550251825938]
We present emphStalactite - an open-source framework for Vertical Federated Learning (VFL) systems.
VFL is a type of FL where data samples are divided by features across several data owners.
We demonstrate its use on a real-world recommendation datasets.
arXiv Detail & Related papers (2024-09-23T21:29:03Z) - Empowering Private Tutoring by Chaining Large Language Models [87.76985829144834]
This work explores the development of a full-fledged intelligent tutoring system powered by state-of-the-art large language models (LLMs)
The system is into three inter-connected core processes-interaction, reflection, and reaction.
Each process is implemented by chaining LLM-powered tools along with dynamically updated memory modules.
arXiv Detail & Related papers (2023-09-15T02:42:03Z) - Machine Learning-Enabled Software and System Architecture Frameworks [48.87872564630711]
The stakeholders with data science and Machine Learning related concerns, such as data scientists and data engineers, are yet to be included in existing architecture frameworks.
We surveyed 61 subject matter experts from over 25 organizations in 10 countries.
arXiv Detail & Related papers (2023-08-09T21:54:34Z) - Scalable Collaborative Learning via Representation Sharing [53.047460465980144]
Federated learning (FL) and Split Learning (SL) are two frameworks that enable collaborative learning while keeping the data private (on device)
In FL, each data holder trains a model locally and releases it to a central server for aggregation.
In SL, the clients must release individual cut-layer activations (smashed data) to the server and wait for its response (during both inference and back propagation).
In this work, we present a novel approach for privacy-preserving machine learning, where the clients collaborate via online knowledge distillation using a contrastive loss.
arXiv Detail & Related papers (2022-11-20T10:49:22Z) - Applied Federated Learning: Architectural Design for Robust and
Efficient Learning in Privacy Aware Settings [0.8454446648908585]
The classical machine learning paradigm requires the aggregation of user data in a central location.
Centralization of data poses risks, including a heightened risk of internal and external security incidents.
Federated learning with differential privacy is designed to avoid the server-side centralization pitfall.
arXiv Detail & Related papers (2022-06-02T00:30:04Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - ATEK: Augmenting Transformers with Expert Knowledge for Indoor Layout
Synthesis [10.213825064088503]
We propose a method that combines expert knowledge, for example, knowledge about ergonomics, with a data-driven generator based on the popular Transformer architecture.
Using this knowledge, the synthesized layouts can be biased to exhibit desirable properties, even if these properties are not present in the dataset.
Our work aims to improve generative machine learning for modeling and provide novel tools for designers and amateurs for the problem of interior layout creation.
arXiv Detail & Related papers (2022-02-01T02:25:04Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - From Distributed Machine Learning to Federated Learning: A Survey [49.7569746460225]
Federated learning emerges as an efficient approach to exploit distributed data and computing resources.
We propose a functional architecture of federated learning systems and a taxonomy of related techniques.
We present the distributed training, data communication, and security of FL systems.
arXiv Detail & Related papers (2021-04-29T14:15:11Z) - Federated Learning: A Signal Processing Perspective [144.63726413692876]
Federated learning is an emerging machine learning paradigm for training models across multiple edge devices holding local datasets, without explicitly exchanging the data.
This article provides a unified systematic framework for federated learning in a manner that encapsulates and highlights the main challenges that are natural to treat using signal processing tools.
arXiv Detail & Related papers (2021-03-31T15:14:39Z) - Collective Knowledge: organizing research projects as a database of
reusable components and portable workflows with common APIs [0.2538209532048866]
This article provides the motivation and overview of the Collective Knowledge framework (CK or cKnowledge)
The CK concept is to decompose research projects into reusable components that encapsulate research artifacts.
The long-term goal is to accelerate innovation by connecting researchers and practitioners to share and reuse all their knowledge.
arXiv Detail & Related papers (2020-11-02T17:42:59Z) - MLCask: Efficient Management of Component Evolution in Collaborative
Data Analytics Pipelines [29.999324319722508]
We address two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.
We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information.
The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases.
arXiv Detail & Related papers (2020-10-17T13:34:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.