Related papers: A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

URL: http://arxiv.org/abs/2505.01658v2
Date: Thu, 08 May 2025 07:08:40 GMT
Title: A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency
Authors: Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee,
Abstract summary: This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines.<n>We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation.
Score: 11.82688729820324
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

Related papers

Optimizing Agentic Language Model Inference via Speculative Tool Calls [4.106903307413157]
We introduce novel systems optimizations to address performance bottlenecks during the inference process.<n>Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents.<n>We recommend a new "tool cache" API endpoint to enable LM providers to easily adopt these optimizations.
arXiv Detail & Related papers (2025-12-17T18:22:44Z)
Automated Design Optimization via Strategic Search with Large Language Models [0.0]
AUTO is a framework that treats design optimization as a gradient-free search problem guided by strategic LLM reasoning.<n>It completes optimizations in approximately 8 hours at an estimated cost of up to $159 per run, compared to an estimated cost of up to $480 with median-wage software developers.
arXiv Detail & Related papers (2025-11-27T17:42:05Z)
xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z)
Compiler Optimization via LLM Reasoning for Efficient Model Serving [7.257845254223727]
We introduce a compilation framework (dubbed REASONING COMPILER) that formulates optimization as a sequential, context-aware decision process.<n>We achieve substantial speedups with markedly fewer samples than leading neural compilers.
arXiv Detail & Related papers (2025-06-02T07:02:46Z)
Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey [59.52058740470727]
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications.<n>Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems.<n>This survey provides a structured tutorial on fundamental architectures, enabling technologies, and emerging applications.
arXiv Detail & Related papers (2025-05-03T13:55:38Z)
A Survey on Inference Optimization Techniques for Mixture of Experts Models [50.40325411764262]
Large-scale Mixture of Experts (MoE) models offer enhanced model capacity and computational efficiency through conditional computation.<n> deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency.<n>This survey analyzes optimization techniques for MoE models across the entire system stack.
arXiv Detail & Related papers (2024-12-18T14:11:15Z)
A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources. We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z)
What Makes Large Language Models Reason in (Multi-Turn) Code Generation? [28.614888506962988]
Chain-of-thought has established itself as a popular vehicle for improving the outputs of large language models (LLMs)<n>We investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements.<n>Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets.
arXiv Detail & Related papers (2024-10-10T16:53:10Z)
On-Device Language Models: A Comprehensive Review [26.759861320845467]
Review examines the challenges of deploying computationally expensive large language models on resource-constrained devices. Paper investigates on-device language models, their efficient architectures, as well as state-of-the-art compression techniques. Case studies of on-device language models from major mobile manufacturers demonstrate real-world applications and potential benefits.
arXiv Detail & Related papers (2024-08-26T03:33:36Z)
Large Language Model as a Catalyst: A Paradigm Shift in Base Station Siting Optimization [62.16747639440893]
Large language models (LLMs) and their associated technologies advance, particularly in the realms of prompt engineering and agent engineering.<n>Our proposed framework incorporates retrieval-augmented generation (RAG) to enhance the system's ability to acquire domain-specific knowledge and generate solutions.
arXiv Detail & Related papers (2024-08-07T08:43:32Z)
Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI. As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z)
ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling [15.67321902882617]
We propose a viable path for training open-source LLMs capable of optimization modeling and developing solver codes.<n>This work also introduces IndustryOR, the first industrial benchmark for evaluating LLMs in solving practical OR problems.
arXiv Detail & Related papers (2024-05-28T01:55:35Z)
Machine Learning Insides OptVerse AI Solver: Design Principles and Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z)
Large Language Models for Supply Chain Optimization [4.554094815136834]
We study how Large Language Models (LLMs) can help bridge the gap between supply chain automation and human comprehension and trust thereof. We design OptiGuide -- a framework that accepts as input queries in plain text, and outputs insights about the underlying outcomes. We demonstrate the effectiveness of our framework on a real server placement scenario within Microsoft's cloud supply chain.
arXiv Detail & Related papers (2023-07-08T01:42:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.