RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of
Language Models
- URL: http://arxiv.org/abs/2309.06619v1
- Date: Tue, 12 Sep 2023 22:22:10 GMT
- Title: RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of
Language Models
- Authors: Yufei Li, Zexin Li, Wei Yang, Cong Liu
- Abstract summary: varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency.
We present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs.
We show that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.
- Score: 12.947537874888717
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in language models (LMs) have gained substantial
attentions on their capability to generate human-like responses. Though
exhibiting a promising future for various applications such as conversation AI,
these LMs face deployment challenges on various devices due to their extreme
computational cost and unpredictable inference latency. Such varied inference
latency, identified as a consequence of uncertainty intrinsic to the nature of
language, can lead to computational inefficiency and degrade the overall
performance of LMs, especially under high-traffic workloads. Unfortunately, the
bandwidth of these uncertainty sources is extensive, complicating the
prediction of latency and the effects emanating from such uncertainties. To
understand and mitigate the impact of uncertainty on real-time
response-demanding systems, we take the first step to comprehend, quantify and
optimize these uncertainty-induced latency performance variations in LMs.
Specifically, we present RT-LM, an uncertainty-aware resource management
ecosystem for real-time inference of LMs. RT-LM innovatively quantifies how
specific input uncertainties, adversely affect latency, often leading to an
increased output length. Exploiting these insights, we devise a lightweight yet
effective method to dynamically correlate input text uncertainties with output
length at runtime. Utilizing this quantification as a latency heuristic, we
integrate the uncertainty information into a system-level scheduler which
explores several uncertainty-induced optimization opportunities, including
uncertainty-aware prioritization, dynamic consolidation, and strategic CPU
offloading. Quantitative experiments across five state-of-the-art LMs on two
hardware platforms demonstrates that RT-LM can significantly reduce the average
response time and improve throughput while incurring a rather small runtime
overhead.
Related papers
- Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications.
Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs.
By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned [36.216938133315786]
We introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within large language models (LLMs)
Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD)
We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics.
arXiv Detail & Related papers (2024-07-22T07:48:30Z) - Future Aware Safe Active Learning of Time Varying Systems using Gaussian Processes [8.678546901075984]
This paper introduces a safe active learning framework tailored for time-varying systems.
The proposed Time-aware Integrated Mean Squared Prediction Error (T-IMSPE) method minimizes posterior variance over current and future states.
arXiv Detail & Related papers (2024-05-17T07:09:52Z) - Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization [20.631476379056892]
Large Language Models (LLMs) are at the forefront of this movement.
LLMs require cloud hosting, which raises issues regarding privacy, latency, and usage limitations.
We present an edge intelligence optimization problem tailored for LLM inference.
arXiv Detail & Related papers (2024-05-12T02:38:58Z) - Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples [63.9198662100875]
In this paper, we aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation.
We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences.
Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
arXiv Detail & Related papers (2024-04-25T12:11:38Z) - Forecasting Long-Time Dynamics in Quantum Many-Body Systems by Dynamic Mode Decomposition [6.381013699474244]
We propose a method that utilizes reliable short-time data of physical quantities to accurately forecast long-time behavior.
The method is based on the dynamic mode decomposition (DMD), which is commonly used in fluid dynamics.
It is demonstrated that the present method enables accurate forecasts at time as long as nearly an order of magnitude longer than that of the short-time training data.
arXiv Detail & Related papers (2024-03-29T03:10:34Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Thrust: Adaptively Propels Large Language Models with External Knowledge [58.72867916604562]
Large-scale pre-trained language models (PTLMs) are shown to encode rich knowledge in their model parameters.
The inherent knowledge in PTLMs can be opaque or static, making external knowledge necessary.
We propose the instance-level adaptive propulsion of external knowledge (IAPEK), where we only conduct the retrieval when necessary.
arXiv Detail & Related papers (2023-07-19T20:16:46Z) - Effective Multi-User Delay-Constrained Scheduling with Deep Recurrent
Reinforcement Learning [28.35473469490186]
Multi-user delay constrained scheduling is important in many real-world applications including wireless communication, live streaming, and cloud computing.
We propose a deep reinforcement learning (DRL) algorithm, named Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient ($mathttRSD4$)
$mathttRSD4$ guarantees resource and delay constraints by Lagrangian dual and delay-sensitive queues, respectively.
It also efficiently tackles partial observability with a memory mechanism enabled by the recurrent neural network (RNN) and introduces user-level decomposition and node-level
arXiv Detail & Related papers (2022-08-30T08:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.