Related papers: Teola: Towards End-to-End Optimization of LLM-based Applications

Teola: Towards End-to-End Optimization of LLM-based Applications

URL: http://arxiv.org/abs/2407.00326v1
Date: Sat, 29 Jun 2024 05:59:53 GMT
Title: Teola: Towards End-to-End Optimization of LLM-based Applications
Authors: Xin Tan, Yimin Jiang, Yitao Yang, Hong Xu,
Abstract summary: Large language model (LLM)-based applications contribute to the end-to-end latency. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph.
Score: 13.478509565946354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.

Related papers

Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation [23.318601470116498]
We aim to improve the offline end-to-end inference efficiency of multi-LLM applications in a single-node multi-GPU environment. We propose a sampling-then-simulation method to estimate the model running time. Experiments on 3 applications and a mixed application show that SamuLLM can achieve 1.0-2.4$times$ end-to-end speedups.
arXiv Detail & Related papers (2025-03-21T06:56:35Z)
Optimizing Model Selection for Compound AI Systems [76.69936664916061]
We propose an efficient framework for model selection in compound systems. It iteratively selects one module and allocates to it the model with the highest module-wise performance. It confers 5%-70% accuracy gains compared to using the same LLM for all modules.
arXiv Detail & Related papers (2025-02-20T18:36:25Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
HyperDPO: Hypernetwork-based Multi-Objective Fine-Tuning Framework [11.342075103251576]
HyperDPO is a hypernetwork-based approach that extends the Direct Preference Optimization (DPO) technique. By substituting the Bradley-Terry-Luce model in DPO with the Plackett-Luce model, our framework is capable of handling a wide range of MOFT tasks.
arXiv Detail & Related papers (2024-10-10T19:06:39Z)
Optimizing Token Usage on Large Language Model Conversations Using the Design Structure Matrix [49.1574468325115]
Large Language Models become ubiquitous in many sectors and tasks. There is a need to reduce token usage, overcoming challenges such as short context windows, limited output sizes, and costs associated with token intake and generation. This work brings the Design Structure Matrix from the engineering design discipline into LLM conversation optimization.
arXiv Detail & Related papers (2024-10-01T14:38:36Z)
ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs. We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks. To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z)
ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency [20.33467627548677]
Large language models (LLMs) have surged in popularity and are extensively used in commercial applications. We conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving.
arXiv Detail & Related papers (2024-07-23T23:37:29Z)
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable [11.894203842968745]
Parrot is a service system that focuses on the end-to-end experience of LLM-based applications. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests.
arXiv Detail & Related papers (2024-05-30T09:46:36Z)
LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning [69.95292905263393]
We show that gradient-based and high-level LLMs can effectively collaborate a combined optimization framework. In this paper, we show that these complementary to each other and can effectively collaborate a combined optimization framework.
arXiv Detail & Related papers (2024-05-30T06:24:14Z)
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs. We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z)
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models [26.2566707495948]
Large Language Models (LLMs) have seen great advance in both academia and industry. We benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs.
arXiv Detail & Related papers (2023-11-07T03:25:56Z)
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs) It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z)
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution. We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
Low-code LLM: Graphical User Interface over Large Language Models [115.08718239772107]
This paper introduces a novel human-LLM interaction framework, Low-code LLM. It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses. We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability.
arXiv Detail & Related papers (2023-04-17T09:27:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.