Fugu-MT 論文翻訳(概要): BatchGen: An Architecture for Scalable and Efficient Batch Inference

論文の概要: BatchGen: An Architecture for Scalable and Efficient Batch Inference

arxiv url: http://arxiv.org/abs/2606.21712v1
Date: Fri, 19 Jun 2026 19:56:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 03:51:13.585928
Title: BatchGen: An Architecture for Scalable and Efficient Batch Inference
Title（参考訳）: BatchGen: スケーラブルで効率的なバッチ推論のためのアーキテクチャ
Authors: Tairan Xu, Leyang Xue, Zhan Lu, Jinfu Deng, Hongyang Xiao, Yinsicheng Jiang, Congjie He, Matej Sandor, Le Xu, Luo Mai,
Abstract要約: バッチ推論はAI計算の中心的なモードとなっているが、既存の推論エンジンはまだインタラクティブなサービス用に設計された実行モデルに依存している。バッチ推論のための新しいアーキテクチャ基盤であるシーケンス計算モデルを導入し、各シーケンスをきめ細かなイベント駆動シーケンスとして表現する。このモデルは、ランタイムが動的に作業を再編成できるように表現力豊かなプリミティブを公開し、より大きなエキスパートレベルのバッチを可能にし、ストラグラーを緩和し、デバイス間での実際の作業を可能にし、コスト効率やメモリ制約のあるGPU上でも利用を維持する。
参考スコア（独自算出の注目度）: 7.794394498151309
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Batch inference has become a central mode of AI computation, yet existing inference engines still rely on execution models designed for interactive serving. When scaled to millions of sequences, batch workloads reveal two fundamental requirements: the ability to handle extreme inter- and intra-sequence load variation that emerges only at runtime, and the ability to sustain high utilization across large fleets of GPUs. Existing systems fail to meet these requirements, losing substantial fractions of achievable throughput. We introduce a new architectural foundation for batch inference: the sequence coroutine compute model, which represents each sequence as a fine-grained, event-driven coroutine. This model exposes expressive primitives that allow the runtime to reorganize work dynamically, enabling larger expert-level batches, mitigating stragglers, reallocating work across devices, and maintaining utilization even on cost-effective or memory-constrained GPUs. Building on this abstraction, we implement BatchGen, a production-ready system that uses the coroutine model at cluster scale. On a 128-GPU cluster, BatchGen reduces batch completion time by up to $2.3\times$, and on memory-constrained accelerators it outperforms the strongest offloading baseline by up to $9.6\times$. We will open-source BatchGen at https://github.com/batchgen-project/batchgen
Abstract（参考訳）: バッチ推論はAI計算の中心的なモードとなっているが、既存の推論エンジンはまだインタラクティブなサービス用に設計された実行モデルに依存している。数百万のシーケンスにスケールすると、バッチワークロードは2つの基本的な要件を明らかにします。既存のシステムはこれらの要件を満たすことができず、達成可能なスループットのかなりの部分を失う。バッチ推論のための新しいアーキテクチャ基盤である、シーケンスコルーチン計算モデルを導入し、各シーケンスをきめ細かなイベント駆動コルーチンとして表現する。このモデルは、ランタイムが動的に作業を再編成できるように表現力豊かなプリミティブを公開し、専門家レベルのバッチを拡大し、ストラグラーを緩和し、デバイス間での作業を再配置し、コスト効率やメモリ制約のあるGPU上でも利用を維持する。この抽象化に基づいて、クラスタスケールでコルーチンモデルを使用するプロダクション対応システムであるBatchGenを実装します。 128-GPUクラスタでは、バッチ完了時間を最大2.3\times$に短縮し、メモリ制限されたアクセラレータでは、最大9.6\times$で最大のオフロードベースラインを上回っている。私たちはBatchGenをhttps://github.com/batchgen-project/batchgenでオープンソース化します。

論文の概要: BatchGen: An Architecture for Scalable and Efficient Batch Inference

関連論文リスト