Fugu-MT 論文翻訳(概要): SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data

論文の概要: SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data

arxiv url: http://arxiv.org/abs/2605.01060v1
Date: Fri, 01 May 2026 19:51:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.560915
Title: SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data
Title（参考訳）: SURGE: 異種分割データのための超バッチ統一資源効率GPU符号化
Authors: Shashank Kapadia, Deep Narayan Mishra, Sujal Reddy Alugubelli, Ajay Kumar, Swapnil Yadav, Rishi Bhatia,
Abstract要約: SURGEは,4万の論理パーティションに8億以上のテキストの埋め込みを生成するために,本番環境にデプロイされたストリーミングエンコーディングシステムである。 4つのNVIDIA L4768を持つ10Mテキストでは、SURGEは26,413のテキスト/sを提供する。
参考スコア（独自算出の注目度）: 3.1624024957575982
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present SURGE, a streaming GPU encoding system deployed in production to generate embeddings for over 800 million texts across 40,000 logical partitions. Production embedding pipelines face a tension between logical data partitioning and efficient GPU utilization: processing each partition independently incurs $P$ inter-process communication (IPC) calls whose overhead limits throughput for compute-light models. Our contributions are analytical: (i) a cost model (Theorem 1) predicting throughput within 2% across three encoders spanning a 15$\times$ parameter range; (ii) a memory-safety bound (Lemma 3) enabling a streaming two-threshold policy with peak memory $O(B_{\min} + n_{\max})$ rather than $O(N)$; and (iii) a $φ$/CV decision framework characterizing when the pattern applies beyond our workload. The naive fix of batching at fixed size requires $O(N)$ peak memory (32.7 GB at 10M texts; infeasible beyond ~60M on 192 GB nodes), produces no output until all encoding completes, and offers no fault tolerance. SURGE achieves the same throughput with $O(B_{\min} + n_{\max})$ bounded memory (2.6 GB), 68$\times$ faster time-to-first-output, and crash recovery at SuperBatch granularity. On 10M texts with 4 NVIDIA L4 GPUs, SURGE delivers 26,413 texts/s -- matching fixed-batch throughput while using 12.6$\times$ less memory. We validate on bge-base (109M, $d$=768, error 1.3%) and across log-normal $σ$ in {1.0, 1.72, 2.5} (speedup invariant within $\pm$3%), and compare against a partition-batched baseline (PB-PBP-LB), against which SURGE retains a 7% throughput edge and 2.5$\times$ faster TTFO. Complementary engineering -- zero-copy Arrow serialization (22-25$\times$ speedup) and async I/O pipelining (up to 93% benefit) -- realizes the design but is not the contribution.
Abstract（参考訳）: SURGEは,4万の論理パーティションに8億以上のテキストの埋め込みを生成するために,実運用環境にデプロイされたストリーミングGPUエンコーディングシステムである。論理データパーティショニングとGPUの効率的な利用との間には,運用パイプラインが緊張関係にある。各パーティションを個別に処理することで,計算ライトモデルのスループットをオーバーヘッドに制限したP$プロセス間通信(IPC)コールが発生します。私たちの貢献は分析的です。 (i) 15$\times$パラメータ範囲にまたがる3エンコーダのスループットを予測するコストモデル(Theorem 1) (ii)メモリセーフティバウンド(Lemma3)により、ピークメモリ$O(B_{\min} + n_{\max})$を$O(N)$よりもむしろ$O(B_{\min} + n_{\max})$でストリーミングできる。 (iii)このパターンがワークロードを超えて適用される場合を特徴付ける$φ$/CV決定フレームワーク。固定サイズでのバッチ処理は、$O(N)$ peak memory (10Mテキストで32.7GB、192GBノードで約60Mを超える)が必要であり、すべてのエンコーディングが完了するまで出力を生成せず、フォールトトレランスも提供しない。 SURGEは同じスループットを$O(B_{\min} + n_{\max})$bounded memory (2.6 GB),68$\times$ faster time-to-first-output, and crash recovery at SuperBatch Granityで達成している。 4つのNVIDIA L4 GPUを持つ10Mテキストでは、SURGEは26,413のテキスト/sを提供する。我々は, bge-base (109M, $d$=768, error 1.3%) およびlog-normal $σ$ in {1.0, 1.72, 2.5} ($\pm$3%内での高速化不変) を検証し, SURGE が 7% のスループットエッジと2.5$\times$ faster TTFO を保持する分割バッチベースライン (PB-PBP-LB) と比較した。補完的なエンジニアリング -- ゼロコピーのArrowシリアライゼーション(22-25$\times$ speedup)と非同期I/Oパイプライニング(最大93%のメリット) -- は、設計を実現するが、コントリビューションではない。

論文の概要: SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data

関連論文リスト