Couler: Unified Machine Learning Workflow Optimization in Cloud
- URL: http://arxiv.org/abs/2403.07608v1
- Date: Tue, 12 Mar 2024 12:47:32 GMT
- Title: Couler: Unified Machine Learning Workflow Optimization in Cloud
- Authors: Xiaoda Wang, Yuan Tang, Tengda Guo, Bo Sang, Jingji Wu, Jian Sha, Ke
Zhang, Jiang Qian, Mingjie Tang
- Abstract summary: Couler is a system designed for unified ML workflow optimization in the cloud.
We integrate Large Language Models (LLMs) into workflow generation, and provide a unified programming interface for various workflow engines.
Couer has successfully improved the CPU/Memory utilization by more than 15% and the workflow completion rate by around 17%.
- Score: 6.769259207650922
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine Learning (ML) has become ubiquitous, fueling data-driven applications
across various organizations. Contrary to the traditional perception of ML in
research, ML workflows can be complex, resource-intensive, and time-consuming.
Expanding an ML workflow to encompass a wider range of data infrastructure and
data types may lead to larger workloads and increased deployment costs.
Currently, numerous workflow engines are available (with over ten being widely
recognized). This variety poses a challenge for end-users in terms of mastering
different engine APIs. While efforts have primarily focused on optimizing ML
Operations (MLOps) for a specific workflow engine, current methods largely
overlook workflow optimization across different engines.
In this work, we design and implement Couler, a system designed for unified
ML workflow optimization in the cloud. Our main insight lies in the ability to
generate an ML workflow using natural language (NL) descriptions. We integrate
Large Language Models (LLMs) into workflow generation, and provide a unified
programming interface for various workflow engines. This approach alleviates
the need to understand various workflow engines' APIs. Moreover, Couler
enhances workflow computation efficiency by introducing automated caching at
multiple stages, enabling large workflow auto-parallelization and automatic
hyperparameters tuning. These enhancements minimize redundant computational
costs and improve fault tolerance during deep learning workflow training.
Couler is extensively deployed in real-world production scenarios at Ant Group,
handling approximately 22k workflows daily, and has successfully improved the
CPU/Memory utilization by more than 15% and the workflow completion rate by
around 17%.
Related papers
- Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey [3.340984908213717]
Building effective machine learning (ML) to address complex tasks is a primary focus of the Automatic ML (AutoML) community.
Recently, the integration of Large Language Models (LLMs) into ML has shown great potential for automating and enhancing various stages of the ML pipeline.
arXiv Detail & Related papers (2024-11-11T21:54:26Z) - WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration.
It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories.
LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z) - AFlow: Automating Agentic Workflow Generation [36.61172223528231]
Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains.
We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search.
Empirical evaluations across six benchmark datasets demonstrate AFlow's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2024-10-14T17:40:40Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.
We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.
We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline.
Recent works have started exploiting large language models (LLM) to lessen such burden.
This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z) - Agent Workflow Memory [71.81385627556398]
We introduce Agent Memory, a method for inducing commonly reused routines.
AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate.
Online AWM robustly generalizes in cross-task, website, and domain evaluations.
arXiv Detail & Related papers (2024-09-11T17:21:00Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - AutoFlow: Automated Workflow Generation for Large Language Model Agents [39.72700864347576]
Large Language Models (LLMs) have shown significant progress in understanding complex natural language.
To make sure LLM Agents follow an effective and reliable procedure to solve the given task, manually designed are usually used.
We propose AutoFlow, a framework designed to automatically generate for agents to solve complex tasks.
arXiv Detail & Related papers (2024-07-01T21:05:02Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.