Couler: Unified Machine Learning Workflow Optimization in Cloud
- URL: http://arxiv.org/abs/2403.07608v1
- Date: Tue, 12 Mar 2024 12:47:32 GMT
- Title: Couler: Unified Machine Learning Workflow Optimization in Cloud
- Authors: Xiaoda Wang, Yuan Tang, Tengda Guo, Bo Sang, Jingji Wu, Jian Sha, Ke
Zhang, Jiang Qian, Mingjie Tang
- Abstract summary: Couler is a system designed for unified ML workflow optimization in the cloud.
We integrate Large Language Models (LLMs) into workflow generation, and provide a unified programming interface for various workflow engines.
Couer has successfully improved the CPU/Memory utilization by more than 15% and the workflow completion rate by around 17%.
- Score: 6.769259207650922
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine Learning (ML) has become ubiquitous, fueling data-driven applications
across various organizations. Contrary to the traditional perception of ML in
research, ML workflows can be complex, resource-intensive, and time-consuming.
Expanding an ML workflow to encompass a wider range of data infrastructure and
data types may lead to larger workloads and increased deployment costs.
Currently, numerous workflow engines are available (with over ten being widely
recognized). This variety poses a challenge for end-users in terms of mastering
different engine APIs. While efforts have primarily focused on optimizing ML
Operations (MLOps) for a specific workflow engine, current methods largely
overlook workflow optimization across different engines.
In this work, we design and implement Couler, a system designed for unified
ML workflow optimization in the cloud. Our main insight lies in the ability to
generate an ML workflow using natural language (NL) descriptions. We integrate
Large Language Models (LLMs) into workflow generation, and provide a unified
programming interface for various workflow engines. This approach alleviates
the need to understand various workflow engines' APIs. Moreover, Couler
enhances workflow computation efficiency by introducing automated caching at
multiple stages, enabling large workflow auto-parallelization and automatic
hyperparameters tuning. These enhancements minimize redundant computational
costs and improve fault tolerance during deep learning workflow training.
Couler is extensively deployed in real-world production scenarios at Ant Group,
handling approximately 22k workflows daily, and has successfully improved the
CPU/Memory utilization by more than 15% and the workflow completion rate by
around 17%.
Related papers
- AFlow: Automating Agentic Workflow Generation [36.61172223528231]
Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains.
We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search.
Empirical evaluations across six benchmark datasets demonstrate AFlow's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2024-10-14T17:40:40Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.
We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.
We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline.
Recent works have started exploiting large language models (LLM) to lessen such burden.
This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z) - Agent Workflow Memory [71.81385627556398]
We introduce Agent Memory, a method for inducing commonly reused routines.
AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate.
Online AWM robustly generalizes in cross-task, website, and domain evaluations.
arXiv Detail & Related papers (2024-09-11T17:21:00Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - AutoFlow: Automated Workflow Generation for Large Language Model Agents [39.72700864347576]
Large Language Models (LLMs) have shown significant progress in understanding complex natural language.
To make sure LLM Agents follow an effective and reliable procedure to solve the given task, manually designed are usually used.
We propose AutoFlow, a framework designed to automatically generate for agents to solve complex tasks.
arXiv Detail & Related papers (2024-07-01T21:05:02Z) - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized
Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs)
It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks.
Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Walle: An End-to-End, General-Purpose, and Large-Scale Production System
for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML)
Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment.
We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z) - Demystifying a Dark Art: Understanding Real-World Machine Learning Model
Development [2.422369741135428]
We analyze over 475k user-generated on OpenML, an open-source platform for tracking and sharing machine learning.
We find that users often adopt a manual, automated, or mixed approach when iterating on their iterations.
arXiv Detail & Related papers (2020-05-04T14:33:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.