Fugu-MT 論文翻訳(概要): Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

論文の概要: Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

arxiv url: http://arxiv.org/abs/2601.17136v1
Date: Fri, 23 Jan 2026 19:25:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-27 15:23:07.302203
Title: Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs
Title（参考訳）: GPU上の線形代数カーネルK平均の通信回避
Authors: Julian Bellavita, Matthew Rubino, Nakul Iyer, Andrew Chang, Aditya Devarakonda, Flavio Vella, Giulia Guidi,
Abstract要約: 我々は大規模なKernel K-meansクラスタリングのための分散メモリ並列アルゴリズムスイートを提案する。我々の手法は、ケルネル K-平均の計算コストが最も高い成分を通信効率の良い分散線形代数プリミティブにマッピングする。我々の1.5Dアルゴリズムは、常に最高性能を達成し、K-meansは従来よりも1～2桁大きなデータにスケールできる。
参考スコア（独自算出の注目度）: 1.0017970035130424
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means. Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2\times$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6\times$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement.
Abstract（参考訳）: クラスタリングはデータ分析において重要なツールであり、K平均はその単純さと汎用性で人気がある。しかし、非線形分離可能なクラスタは扱えない。 Kernel K-meansはこの制限に対処するが、大きなカーネルマトリックスを必要とするため、計算とメモリ集約が可能である。それまでの作業は、疎線型代数プリミティブを用いて定式化し、1つのGPU上で実装することで、カーネルK平均を加速した。しかし、このアプローチはGPUメモリが限られているため、約8万以上のサンプルを持つデータセットでは実行できない。本稿では,マルチGPUシステム上で大規模K-meansクラスタリングを行う分散並列並列アルゴリズムのスイートを提案することで,この問題に対処する。提案手法は,K-meansの計算コストの高いコンポーネントをKernel K-means用に一意に調整された通信効率のよい分散線形代数プリミティブにマッピングし,数百万スケールのデータセットを効率的にクラスタリングする高度にスケーラブルな実装を実現する。我々の研究の中心は、ケルネル K-平均に現れる線型代数原始体の通信効率の良い合成を可能にする分割スキームの設計である。我々の1.5Dアルゴリズムは、常に最高性能を達成し、K-meansは従来よりも1～2桁大きなデータにスケールできる。 256 GPUでは、幾何平均スケーリング効率が79.7 %$で、幾何学平均スケーリングスピードが4.2 times$である。 1Dアルゴリズムと比較して、1.5Dアプローチは256GPU上で最大3.6\times$のスピードアップを実現し、単一GPUスライディングウインドウの実装と比較して1時間以上から2秒未満のクラスタリング時間を削減する。アプリケーション固有の線形代数的定式化を用いて設計した分散アルゴリズムは,大幅な性能向上を実現することができることを示す。

論文の概要: Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

関連論文リスト