Related papers: LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

URL: http://arxiv.org/abs/2511.07229v1
Date: Mon, 10 Nov 2025 15:47:53 GMT
Title: LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
Authors: Jaehong Cho, Hyunmin Choi, Jongse Park,
Abstract summary: This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems.<n>It addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques.
Score: 4.382902234869111
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5x fewer LoC and outperforms the predecessor's hardware-simulator integration, demonstrating LLMServingSim2.0's low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.

Related papers

LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure [4.1898448424363695]
Large language model (LLM) serving infrastructures are undergoing a shift toward heterogeneous and disaggregation.<n>This paper presents LLMServingSim 2.0, a unified system-level simulator designed to make runtime-driven hardware-software interactions explicit and analyzable.
arXiv Detail & Related papers (2026-02-26T14:22:17Z)
Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing [16.063514680699576]
Multimodal large language models (MLLMs) extend visual understanding through a three-stage pipeline.<n> multimodal preprocessing-especially video decoding-often dominates Time-to-First-Token (TTFT)<n>We present FlashCodec and UnifiedServe, two complementary designs that jointly optimize the end-to-end MLLM pipeline.
arXiv Detail & Related papers (2025-12-19T13:40:13Z)
AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs [0.5863360388454261]
We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLM) platform.<n>It uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes.<n>It features a unified client interface that allows seamless interaction with all deployed LLMs.
arXiv Detail & Related papers (2025-11-06T14:19:57Z)
Simulating Environments with Reasoning Models for Agent Training [55.98861707136674]
Building bespoke environments for training is heavy, brittle, and limits progress.<n>We propose two frameworks: Simia-SFT and Simia-RL.<n>Simia-SFT and Simia-RL enable scalable agent training without environment engineering.
arXiv Detail & Related papers (2025-11-03T18:29:57Z)
LLM-I: LLMs are Naturally Interleaved Multimodal Creators [24.64752837827959]
LLM-Interleaved (LLM-I) is a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem.<n>Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools.<n>LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks.
arXiv Detail & Related papers (2025-09-17T02:33:29Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation [13.326000659635378]
Phantora is a hybrid GPU cluster simulator for performance estimation of machine learning training workloads.<n>It allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation.<n>Phantora supports three state-of-the-art training frameworks out-of-the-box.
arXiv Detail & Related papers (2025-05-02T22:36:24Z)
Elastic On-Device LLM Service [11.778868057819269]
We present sys, an on-device Large Language Models service that elasticizes both the model and the dimension of a full LLM.<n>sys outperforms 7 strong baselines in (absolute) accuracy by up to 14.83% and 10.45% on average, with 1% TTFT switching overhead, on-par memory consumption and 100 offline GPU hours.
arXiv Detail & Related papers (2024-09-08T06:32:08Z)
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale [17.00936774784349]
There is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in large language model (LLM) serving systems. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems.
arXiv Detail & Related papers (2024-08-10T09:26:15Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution. We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [54.692405042065815]
We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ protects only 1% salient weights and achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. We also implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs.
arXiv Detail & Related papers (2023-06-01T17:59:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.