MLOS: An Infrastructure for Automated Software Performance Engineering
- URL: http://arxiv.org/abs/2006.02155v2
- Date: Thu, 4 Jun 2020 11:10:53 GMT
- Title: MLOS: An Infrastructure for Automated Software Performance Engineering
- Authors: Carlo Curino, Neha Godwal, Brian Kroth, Sergiy Kuryata, Greg Lapinski,
Siqi Liu, Slava Oks, Olga Poppe, Adam Smiechowski, Ed Thayer, Markus Weimer,
Yiwen Zhu
- Abstract summary: We present MLOS, an ML-powered infrastructure and methodology to democratize Software Performance Engineering.
MLOS enables continuous, instance-level, robust, and trackable systems optimization.
We are in the process of open-sourcing the MLOS core infrastructure, and we are engaging with academic institutions to create an educational program around Software 2.0 and MLOS ideas.
- Score: 14.244308246225744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing modern systems software is a complex task that combines business
logic programming and Software Performance Engineering (SPE). The later is an
experimental and labor-intensive activity focused on optimizing the system for
a given hardware, software, and workload (hw/sw/wl) context.
Today's SPE is performed during build/release phases by specialized teams,
and cursed by: 1) lack of standardized and automated tools, 2) significant
repeated work as hw/sw/wl context changes, 3) fragility induced by a
"one-size-fit-all" tuning (where improvements on one workload or component may
impact others). The net result: despite costly investments, system software is
often outside its optimal operating point - anecdotally leaving 30% to 40% of
performance on the table.
The recent developments in Data Science (DS) hints at an opportunity:
combining DS tooling and methodologies with a new developer experience to
transform the practice of SPE. In this paper we present: MLOS, an ML-powered
infrastructure and methodology to democratize and automate Software Performance
Engineering. MLOS enables continuous, instance-level, robust, and trackable
systems optimization. MLOS is being developed and employed within Microsoft to
optimize SQL Server performance. Early results indicated that component-level
optimizations can lead to 20%-90% improvements when custom-tuning for a
specific hw/sw/wl, hinting at a significant opportunity. However, several
research challenges remain that will require community involvement. To this
end, we are in the process of open-sourcing the MLOS core infrastructure, and
we are engaging with academic institutions to create an educational program
around Software 2.0 and MLOS ideas.
Related papers
- LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%.
We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE)
STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized
Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs)
It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks.
Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z) - Towards an MLOps Architecture for XAI in Industrial Applications [2.0457031151514977]
Machine learning (ML) has become a popular tool in the industrial sector as it helps to improve operations, increase efficiency, and reduce costs.
One of the remaining Machine Learning Operations (MLOps) challenges is the need for explanations.
We developed a novel MLOps software architecture to address the challenge of integrating explanations and feedback capabilities into the ML development and deployment processes.
arXiv Detail & Related papers (2023-09-22T09:56:25Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Reasonable Scale Machine Learning with Open-Source Metaflow [2.637746074346334]
We argue that re-purposing existing tools won't solve the current productivity issues.
We introduce Metaflow, an open-source framework for ML projects explicitly designed to boost the productivity of data practitioners.
arXiv Detail & Related papers (2023-03-21T11:28:09Z) - Operationalizing Machine Learning: An Interview Study [13.300075655862573]
We conduct semi-structured interviews with 18 machine learning engineers (MLEs) working across many applications.
Our interviews expose three variables that govern success for a production ML deployment: Velocity, Validation, and Versioning.
We summarize common practices for successful ML experimentation, deployment, and sustaining production performance.
arXiv Detail & Related papers (2022-09-16T16:59:36Z) - Exploring the potential of flow-based programming for machine learning
deployment in comparison with service-oriented architectures [8.677012233188968]
We argue that part of the reason is infrastructure that was not designed for activities around data collection and analysis.
We propose to consider flow-based programming with data streams as an alternative to commonly used service-oriented architectures for building software applications.
arXiv Detail & Related papers (2021-08-09T15:06:02Z) - Characterizing and Detecting Mismatch in Machine-Learning-Enabled
Systems [1.4695979686066065]
Development and deployment of machine learning systems remains a challenge.
In this paper, we report our findings and their implications for improving end-to-end ML-enabled system development.
arXiv Detail & Related papers (2021-03-25T19:40:29Z) - Technology Readiness Levels for Machine Learning Systems [107.56979560568232]
Development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end.
We have developed a proven systems engineering approach for machine learning development and deployment.
Our "Machine Learning Technology Readiness Levels" framework defines a principled process to ensure robust, reliable, and responsible systems.
arXiv Detail & Related papers (2021-01-11T15:54:48Z) - Technology Readiness Levels for AI & ML [79.22051549519989]
Development of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end.
Engineering systems follow well-defined processes and testing standards to streamline development for high-quality, reliable results.
We propose a proven systems engineering approach for machine learning development and deployment.
arXiv Detail & Related papers (2020-06-21T17:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.