Fugu-MT 論文翻訳(概要): Generalization Performance of Ensemble Clustering: From Theory to Algorithm

論文の概要: Generalization Performance of Ensemble Clustering: From Theory to Algorithm

arxiv url: http://arxiv.org/abs/2506.02053v1
Date: Sun, 01 Jun 2025 09:34:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-04 21:47:34.880155
Title: Generalization Performance of Ensemble Clustering: From Theory to Algorithm
Title（参考訳）: アンサンブルクラスタリングの一般化性能:理論からアルゴリズムへ
Authors: Xu Zhang, Haoye Qiu, Weixuan Liang, Hui Liu, Junhui Hou, Yuheng Jia,
Abstract要約: 本稿では,アンサンブルクラスタリングにおける一般化誤差,過剰リスク,一貫性に着目した。有限クラスタリングに様々な重みを割り当てることで、経験的平均クラスタリングと期待値との誤差を最小化する。我々は、新しいアンサンブルクラスタリングアルゴリズムを開発するために、我々の理論をインスタンス化する。
参考スコア（独自算出の注目度）: 57.176040163699554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensemble clustering has demonstrated great success in practice; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of $\mathcal{O}(\sqrt{\frac{\log n}{m}}+\frac{1}{\sqrt{n}})$, with $n$ and $m$ being the numbers of samples and base clusterings. Based on this, we prove that when $m$ and $n$ approach infinity and $m$ is significantly larger than log $n$, i.e., $m,n\to \infty, m\gg \log n$, ensemble clustering is consistent. Furthermore, recognizing that $n$ and $m$ are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to finite clusterings, we minimize the error between the empirical average clusterings and their expectation. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (min-max) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1%, 7.3%, and 6.0% on 10 datasets w.r.t. NMI, ARI, and Purity. The code is available at https://github.com/xuz2019/GPEC.
Abstract（参考訳）: アンサンブルクラスタリングは実際に大きな成功を収めてきたが、その理論的基礎は未解明のままである。本稿では,アンサンブルクラスタリングの一般化性能について検討し,一般化誤差,過剰リスク,一貫性に着目した。一般化誤差境界と余剰リスク境界の収束率を$\mathcal{O}(\sqrt {\frac {\log n}{m}}+\frac{1}{\sqrt{n}})$で導き、$n$と$m$はサンプルと基底クラスタリングの数である。これに基づいて、$m$と$n$アプローチ無限大と$m$がlog $n$よりもはるかに大きいとき、すなわち$m,n\to \infty, m\gg \log n$, アンサンブルクラスタリングは一貫したものであることを証明した。さらに、実際に$n$ と $m$ が有限であることを認識すると、一般化誤差は 0 に還元できない。したがって、異なる重みを有限クラスタリングに割り当てることで、経験的平均クラスタリングと期待値との誤差を最小化する。このことから, より優れたクラスタリング性能を実現するためには, 期待するベースクラスタリングの偏差(バイアス)を最小化し, 様々なベースクラスタリングの違い(多様性)を最大化する必要がある。さらに、多様性の最大化は、ロバストな(最小限の)最適化モデルとほぼ同値である。最後に、我々の理論をインスタンス化し、新しいアンサンブルクラスタリングアルゴリズムを開発する。 SOTA法と比較すると,NMI,ARI,Purityの10データセットに対して平均6.1%,7.3%,6.0%の改善が達成されている。コードはhttps://github.com/xuz2019/GPECで公開されている。

論文の概要: Generalization Performance of Ensemble Clustering: From Theory to Algorithm

関連論文リスト