Fugu-MT 論文翻訳(概要): Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval

論文の概要: Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval

arxiv url: http://arxiv.org/abs/2603.25011v1
Date: Thu, 26 Mar 2026 04:20:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.095312
Title: Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval
Title（参考訳）: Sparton: 学習したスパース検索のための高速かつメモリ効率の良いトリトンカーネル
Authors: Thong Nguyen, Cosimo Rulli, Franco Maria Nardini, Rossano Venturini, Andrew Yates,
Abstract要約: Spladeのような最先端のLearted Sparse Retrieval (LSR)モデルでは、Language Modeling (LM)ヘッドを使用して、潜伏した隠された状態を語彙的にアンコールされたロジット行列に投影する。その効果にもかかわらず、LMヘッドは語彙の重大さによる大きなメモリボトルネックを発生させる(V)。 LSRモデルにおけるLMヘッドに適した高速メモリ効率のTritonカーネルであるSpartonを提案する。
参考スコア（独自算出の注目度）: 21.607735361193622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art Learned Sparse Retrieval (LSR) models, such as Splade, typically employ a Language Modeling (LM) head to project latent hidden states into a lexically-anchored logit matrix. This intermediate matrix is subsequently transformed into a sparse lexical representation through element-wise operations (ReLU, Log1P) and max-pooling over the sequence dimension. Despite its effectiveness, the LM head creates a massive memory bottleneck due to the sheer size of the vocabulary (V), which can range from 30,000 to over 250,000 tokens in recent models. Materializing this matrix creates a significant memory bottleneck, limiting model scaling. The resulting I/O overhead between operators further throttles throughput and runtime performance. In this paper, we propose Sparton, a fast memory-efficient Triton kernel tailored for the LM head in LSR models. Sparton utilizes a fused approach that integrates the tiled matrix multiplication, ReLU, Log1P, and max-reduction into a single GPU kernel. By performing an early online reduction directly on raw logit tiles, Sparton avoids materializing the full logit matrix in memory. Our experiments demonstrate that the Sparton kernel, in isolation, achieves up to a 4.8x speedup and an order-of-magnitude reduction in peak memory usage compared to PyTorch baselines. Integrated into Splade (|V| ~ 30k), Sparton enables a 33% larger batch size and 14% faster training with no effectiveness loss. On a multilingual backbone (|V| ~ 250k), these gains jump to a 26x larger batch size and 2.5x faster training.
Abstract（参考訳）: Spladeのような最先端のLearted Sparse Retrieval(LSR)モデルは、通常、潜伏した隠された状態を語彙的にアンコールされたロジット行列に投影するために言語モデリング(LM)ヘッドを使用する。この中間行列はその後、要素演算(ReLU, Log1P)と列次元上の最大プーリングを通じてスパース語彙表現に変換される。その効果にもかかわらず、LMヘッドは語彙(V)の重大さのために巨大なメモリボトルネックを発生させ、近年のモデルでは30,000から25万以上のトークンを発生させることができる。このマトリックスを物質化すると、大きなメモリボトルネックが発生し、モデルのスケーリングが制限されます。その結果、演算子間のI/Oオーバーヘッドはさらにスループットと実行時のパフォーマンスを損なう。本稿では,LSRモデルにおけるLMヘッドに適した高速メモリ効率のTritonカーネルであるSpartonを提案する。 Spartonは、タイル付き行列乗算、ReLU、Log1P、最大還元を単一のGPUカーネルに統合する融合アプローチを使用している。 Spartonは、生のロジットタイルに直接オンライン還元を行うことで、メモリの完全なロジットマトリックスを実体化するのを避ける。実験により,SpartonカーネルはPyTorchベースラインと比較して,最大4.8倍の高速化とピークメモリ使用量の大幅な削減を実現した。 Splade (|V| ~30k)に統合されたSpartonでは、バッチサイズが33%大きく、トレーニングが14%高速で、効率が損なわれない。多言語バックボーン(|V| ~250k)では、これらのゲインは26倍のバッチサイズにジャンプし、2.5倍高速なトレーニングを行う。

論文の概要: Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval

関連論文リスト