Related papers: RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models

RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models

URL: http://arxiv.org/abs/2309.06619v1
Date: Tue, 12 Sep 2023 22:22:10 GMT
Title: RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models
Authors: Yufei Li, Zexin Li, Wei Yang, Cong Liu
Abstract summary: varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency. We present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs. We show that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.
Score: 12.947537874888717
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advancements in language models (LMs) have gained substantial attentions on their capability to generate human-like responses. Though exhibiting a promising future for various applications such as conversation AI, these LMs face deployment challenges on various devices due to their extreme computational cost and unpredictable inference latency. Such varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency and degrade the overall performance of LMs, especially under high-traffic workloads. Unfortunately, the bandwidth of these uncertainty sources is extensive, complicating the prediction of latency and the effects emanating from such uncertainties. To understand and mitigate the impact of uncertainty on real-time response-demanding systems, we take the first step to comprehend, quantify and optimize these uncertainty-induced latency performance variations in LMs. Specifically, we present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs. RT-LM innovatively quantifies how specific input uncertainties, adversely affect latency, often leading to an increased output length. Exploiting these insights, we devise a lightweight yet effective method to dynamically correlate input text uncertainties with output length at runtime. Utilizing this quantification as a latency heuristic, we integrate the uncertainty information into a system-level scheduler which explores several uncertainty-induced optimization opportunities, including uncertainty-aware prioritization, dynamic consolidation, and strategic CPU offloading. Quantitative experiments across five state-of-the-art LMs on two hardware platforms demonstrates that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.

Related papers

Learning to Inference Adaptively for Multimodal Large Language Models [19.510735093226703]
We introduce AdaLLaVA, an adaptive inference framework that learns to reconfigure operations in an MLLM during inference. We conduct experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime.
arXiv Detail & Related papers (2025-03-13T21:39:38Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
The Order Effect: Investigating Prompt Sensitivity in Closed-Source LLMs [19.798249518847694]
This paper investigates the extent of order sensitivity in large language models (LLMs) Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation, however, fails to fully resolve the problem.
arXiv Detail & Related papers (2025-02-06T15:14:02Z)
Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned [36.216938133315786]
We introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within large language models (LLMs) Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD) We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics.
arXiv Detail & Related papers (2024-07-22T07:48:30Z)
Future Aware Safe Active Learning of Time Varying Systems using Gaussian Processes [8.678546901075984]
This paper introduces a safe active learning framework tailored for time-varying systems. The proposed Time-aware Integrated Mean Squared Prediction Error (T-IMSPE) method minimizes posterior variance over current and future states.
arXiv Detail & Related papers (2024-05-17T07:09:52Z)
Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization [20.631476379056892]
Large Language Models (LLMs) are at the forefront of this movement. LLMs require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. We present an edge intelligence optimization problem tailored for LLM inference.
arXiv Detail & Related papers (2024-05-12T02:38:58Z)
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples [63.9198662100875]
In this paper, we aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
arXiv Detail & Related papers (2024-04-25T12:11:38Z)
Forecasting Long-Time Dynamics in Quantum Many-Body Systems by Dynamic Mode Decomposition [6.381013699474244]
We propose a method that utilizes reliable short-time data of physical quantities to accurately forecast long-time behavior. The method is based on the dynamic mode decomposition (DMD), which is commonly used in fluid dynamics. It is demonstrated that the present method enables accurate forecasts at time as long as nearly an order of magnitude longer than that of the short-time training data.
arXiv Detail & Related papers (2024-03-29T03:10:34Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z)
Thrust: Adaptively Propels Large Language Models with External Knowledge [58.72867916604562]
Large-scale pre-trained language models (PTLMs) are shown to encode rich knowledge in their model parameters. The inherent knowledge in PTLMs can be opaque or static, making external knowledge necessary. We propose the instance-level adaptive propulsion of external knowledge (IAPEK), where we only conduct the retrieval when necessary.
arXiv Detail & Related papers (2023-07-19T20:16:46Z)
Effective Multi-User Delay-Constrained Scheduling with Deep Recurrent Reinforcement Learning [28.35473469490186]
Multi-user delay constrained scheduling is important in many real-world applications including wireless communication, live streaming, and cloud computing. We propose a deep reinforcement learning (DRL) algorithm, named Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient ($mathttRSD4$) $mathttRSD4$ guarantees resource and delay constraints by Lagrangian dual and delay-sensitive queues, respectively. It also efficiently tackles partial observability with a memory mechanism enabled by the recurrent neural network (RNN) and introduces user-level decomposition and node-level
arXiv Detail & Related papers (2022-08-30T08:44:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.