Related papers: LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

URL: http://arxiv.org/abs/2408.05499v1
Date: Sat, 10 Aug 2024 09:26:15 GMT
Title: LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Authors: Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park,
Abstract summary: There is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in large language model (LLM) serving systems. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems.
Score: 17.00936774784349
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute developments of various hardware acceleration techniques. Nevertheless, there is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in LLM serving systems without extensively extending the simulation time. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems. In designing LLMServingSim, we focus on two limitations of existing simulators: (1) they lack consideration of the dynamic workload variations of LLM inference serving due to its autoregressive nature, and (2) they incur repetitive simulations without leveraging algorithmic redundancies in LLMs. To address these limitations, LLMServingSim simulates the LLM serving in the granularity of iterations, leveraging the computation redundancies across decoder blocks and reusing the simulation results from previous iterations. Additionally, LLMServingSim provides a flexible framework that allows users to plug in any accelerator compiler-and-simulation stacks for exploring various system designs with heterogeneous processors. Our experiments demonstrate that LLMServingSim produces simulation results closely following the performance behaviors of real GPU-based LLM serving system with less than 14.7% error rate, while offering 91.5x faster simulation speed compared to existing accelerator simulators.

Related papers

SPICEAssistant: LLM using SPICE Simulation Tools for Schematic Design of Switched-Mode Power Supplies [0.0]
State-of-the-art large language models (LLMs) show high performance across a wide range of tasks in many domains of science.<n>This paper focuses on the application of LLMs to switched-mode power supply (SMPS) design on printed circuit boards (PCBs)
arXiv Detail & Related papers (2025-07-14T13:41:12Z)
Experiments with Large Language Models on Retrieval-Augmented Generation for Closed-Source Simulation Software [0.36832029288386137]
Retrieval-Augmented Generation (RAG) might yield a solution for knowledge-intensive tasks. This paper explores the application of RAG to closed-source simulation software.
arXiv Detail & Related papers (2025-02-06T09:48:04Z)
Simulation Streams: A Programming Paradigm for Controlling Large Language Models and Building Complex Systems with Generative AI [3.3126968968429407]
Simulation Streams is a programming paradigm designed to efficiently control and leverage Large Language Models (LLMs) Our primary goal is to create a framework that harnesses the agentic abilities of LLMs while addressing their limitations in maintaining consistency.
arXiv Detail & Related papers (2025-01-30T16:38:03Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
Social Science Meets LLMs: How Reliable Are Large Language Models in Social Simulations? [40.00556764679785]
Large Language Models (LLMs) are increasingly employed for simulations, enabling applications in role-playing agents and Computational Social Science (CSS) In this paper, we aim to answer How reliable is LLM-based simulation?''
arXiv Detail & Related papers (2024-10-30T20:09:37Z)
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline [1.4255659581428337]
This work proposes a modular framework that integrates expertise from both the power system and large language models. It improves GPT-4o's simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface's 33.8% accuracy.
arXiv Detail & Related papers (2024-06-25T02:05:26Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
Code Simulation Challenges for Large Language Models [6.970495767499435]
This work studies to what extent Large Language Models (LLMs) can simulate coding and algorithmic tasks. We introduce benchmarks for straight-line programs, code that contains critical paths, and approximate and redundant instructions. We propose a novel off-the-shelf prompting method, Chain of Simulation (CoSm), which instructs LLMs to simulate code execution line by line/follow the pattern of compilers.
arXiv Detail & Related papers (2024-01-17T09:23:59Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
SimNet: Computer Architecture Simulation using Machine Learning [3.7019798164954336]
This work describes a concerted effort, where machine learning (ML) is used to accelerate discrete-event simulation. A GPU-accelerated parallel simulator is implemented based on the proposed instruction latency predictor. Its simulation accuracy and throughput are validated and evaluated against a state-of-the-art simulator.
arXiv Detail & Related papers (2021-05-12T17:31:52Z)
Achieving 100X faster simulations of complex biological phenomena by coupling ML to HPC ensembles [47.44377051031385]
We present DeepDriveMD, a tool for a range of prototypical ML-driven HPC simulation scenarios. We use it to quantify improvements in the scientific performance of ML-driven ensemble-based applications.
arXiv Detail & Related papers (2021-04-10T15:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.