Fugu-MT 論文翻訳(概要): Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

論文の概要: Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

arxiv url: http://arxiv.org/abs/2508.17814v1
Date: Mon, 25 Aug 2025 09:11:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.702476
Title: Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture
Title（参考訳）: SLURMに基づくHPCアーキテクチャにおける拡張エンジンと異なるLLMモデルの性能
Authors: Anderson de Lima Luiz, Shubham Vijay Kurlekar, Munir Georges,
Abstract要約: 本研究は、SLURM(Simple Linux Utility for Resource Management)に基づく高性能コンピューティングアーキテクチャについて詳述する。動的リソーススケジューリングとコンテナ化のシームレスな統合は、CPU、GPU、メモリをマルチノードクラスタで効率的に管理するために活用されている。その結果,大規模HPCインフラストラクチャ上でのLLM推論は,より効率的で応答性が高く,耐故障性に優れた。
参考スコア（独自算出の注目度）: 3.746889836344766
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work elaborates on a High performance computing (HPC) architecture based on Simple Linux Utility for Resource Management (SLURM) [1] for deploying heterogeneous Large Language Models (LLMs) into a scalable inference engine. Dynamic resource scheduling and seamless integration of containerized microservices have been leveraged herein to manage CPU, GPU, and memory allocations efficiently in multi-node clusters. Extensive experiments, using Llama 3.2 (1B and 3B parameters) [2] and Llama 3.1 (8B and 70B) [3], probe throughput, latency, and concurrency and show that small models can handle up to 128 concurrent requests at sub-50 ms latency, while for larger models, saturation happens with as few as two concurrent users, with a latency of more than 2 seconds. This architecture includes Representational State Transfer Application Programming Interfaces (REST APIs) [4] endpoints for single and bulk inferences, as well as advanced workflows such as multi-step "tribunal" refinement. Experimental results confirm minimal overhead from container and scheduling activities and show that the approach scales reliably both for batch and interactive settings. We further illustrate real-world scenarios, including the deployment of chatbots with retrievalaugmented generation, which helps to demonstrate the flexibility and robustness of the architecture. The obtained results pave ways for significantly more efficient, responsive, and fault-tolerant LLM inference on large-scale HPC infrastructures.
Abstract（参考訳）: 本研究は、多種多言語言語モデル(LLM)をスケーラブルな推論エンジンにデプロイするための、SLURM(Simple Linux Utility for Resource Management)[1]に基づくハイパフォーマンスコンピューティング(HPC)アーキテクチャについて詳述する。ここでは、動的リソーススケジューリングとコンテナ化されたマイクロサービスのシームレスな統合を利用して、マルチノードクラスタでのCPU、GPU、メモリ割り当てを効率的に管理している。 Llama 3.2 (1B と 3B のパラメータ) [2] と Llama 3.1 (8B と 70B) [3] を使い、プローブスループット、レイテンシ、並行処理を行い、50ms以下のレイテンシで128の同時リクエストを処理できる一方で、大きなモデルでは、飽和は2秒以上のレイテンシを持つ2つの同時ユーザで発生する。このアーキテクチャには、Representational State Transfer Application Programming Interfaces(REST API) [4]エンドポイント、マルチステップの"トリビューナル"リファインメントのような高度なワークフローが含まれています。実験の結果、コンテナとスケジューリングアクティビティのオーバーヘッドが最小限であることを確認し、バッチとインタラクティブな設定の両方に対して、アプローチが確実にスケール可能であることを示す。さらに、検索強化世代によるチャットボットの展開など、現実のシナリオについても説明し、アーキテクチャの柔軟性と堅牢性を示すのに役立ちます。その結果,大規模HPCインフラストラクチャ上でのLLM推論は,より効率的で応答性が高く,耐故障性に優れた。

論文の概要: Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

関連論文リスト