Large-scale semi-supervised learning with online spectral graph sparsification
Abstract Overview
The paper introduces Sparse-HFS, a scalable graph-based semi-supervised learning algorithm that operates under stringent memory and computation constraints where storing the full similarity graph is infeasible. It combines online spectral graph sparsification (Kelner and Levin, 2013) with SDD matrix solvers to incrementally maintain a spectral sparsifier from an insertion-only edge stream, achieving O(n polylog(n)) space and O(m polylog(n)) time complexity. The paper provides a theoretical generalization bound via algorithmic stability analysis that quantifies how sparsification accuracy ε affects learning performance. Experiments on a synthetic dataset demonstrate that the approximate method achieves accuracy close to exact stable-HFS while using substantially fewer edges.
Novelty
The main novelty is the integration of online spectral graph sparsification into graph-based semi-supervised learning, yielding a semi-streaming variant of stable-HFS that never requires storing the full graph in memory. The work is also distinctive in providing a stability-based generalization analysis that formally quantifies the effect of the sparsification approximation on learning performance, showing the convergence rate is preserved for fixed ε.
Results
Theorem 1 proves that for fixed sparsification accuracy ε, Sparse-HFS retains the same asymptotic order of generalization-error convergence as exact stable-HFS, with an additional approximation term scaling as ε²/l² that is dominated by other bound terms. In experiments on a synthetic dataset with 12,100 points and dense k-NN graphs (ε=0.8, γ=1), at k=4500 where accuracy is near optimal, the sparsifier retains only about 10% of the original graph's edges while achieving similar accuracy to stable-HFS.
Key Points
- Sparse-HFS maintains an online (1±ε) spectral sparsifier so that graph-based SSL can be performed in O(n polylog(n)) space and O(m polylog(n)) time, processing edges from an insertion-only stream without ever storing the full graph.
- Theoretical analysis via algorithmic stability shows that sparsification adds a controlled approximation term (scaling as ε²/l²) to the generalization bound, while preserving the same asymptotic convergence order as exact stable-HFS for fixed ε.
- On a synthetic benchmark with 12,100 points, Sparse-HFS closely tracks stable-HFS performance near the best operating region (k≈4500) while retaining only about 10% of the original graph's edges.