Minimal Variance Sampling with Provable Guarantees for Fast Training of
Graph Neural Networks
- URL: http://arxiv.org/abs/2006.13866v2
- Date: Sun, 5 Sep 2021 16:41:27 GMT
- Title: Minimal Variance Sampling with Provable Guarantees for Fast Training of
Graph Neural Networks
- Authors: Weilin Cong, Rana Forsati, Mahmut Kandemir, Mehrdad Mahdavi
- Abstract summary: Existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization.
We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance.
We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization.
- Score: 22.618779809748435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an
indispensable strategy to speed up training large-scale Graph Neural Networks
(GNNs). However, existing sampling methods are mostly based on the graph
structural information and ignore the dynamicity of optimization, which leads
to high variance in estimating the stochastic gradients. The high variance
issue can be very pronounced in extremely large graphs, where it results in
slow convergence and poor generalization. In this paper, we theoretically
analyze the variance of sampling methods and show that, due to the composite
structure of empirical risk, the variance of any sampling method can be
decomposed into \textit{embedding approximation variance} in the forward stage
and \textit{stochastic gradient variance} in the backward stage that
necessities mitigating both types of variance to obtain faster convergence
rate. We propose a decoupled variance reduction strategy that employs
(approximate) gradient information to adaptively sample nodes with minimal
variance, and explicitly reduces the variance introduced by embedding
approximation. We show theoretically and empirically that the proposed method,
even with smaller mini-batch sizes, enjoys a faster convergence rate and
entails a better generalization compared to the existing methods.
Related papers
- Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization [0.626226809683956]
We introduce two methods to solve gradient-based graph sparsity optimization: GraphRG-IHT and GraphSG-IHT.
We provide a general general for theoretical analysis, demonstrating that methods enjoy a gradient-based framework.
arXiv Detail & Related papers (2024-07-24T03:26:26Z) - A Geometric Perspective on Diffusion Models [57.27857591493788]
We inspect the ODE-based sampling of a popular variance-exploding SDE.
We establish a theoretical relationship between the optimal ODE-based sampling and the classic mean-shift (mode-seeking) algorithm.
arXiv Detail & Related papers (2023-05-31T15:33:16Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - On the Importance of Sampling in Learning Graph Convolutional Networks [13.713485304798368]
Graph Convolutional Networks (GCNs) have achieved impressive empirical advancement across a wide variety of graph-related applications.
Despite their success, training GCNs on large graphs suffers from computational and memory issues.
We describe and analyze a general textbftextitdoubly variance reduction schema that can accelerate any sampling method under the memory budget.
arXiv Detail & Related papers (2021-03-03T21:31:23Z) - Effective Proximal Methods for Non-convex Non-smooth Regularized
Learning [27.775096437736973]
We show that the independent sampling scheme tends to improve performance of the commonly-used uniform sampling scheme.
Our new analysis also derives a speed for the sampling than best one available so far.
arXiv Detail & Related papers (2020-09-14T16:41:32Z) - Unified Analysis of Stochastic Gradient Methods for Composite Convex and
Smooth Optimization [15.82816385434718]
We present a unified theorem for the convergence analysis of gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer.
We do this by extending the unified analysis of Gorbunov, Hanzely & Richt'arik ( 2020) and dropping the requirement that the loss function be strongly convex.
Our unified analysis applies to a host of existing algorithms such as proximal SGD, variance reduced methods, quantization and some coordinate descent type methods.
arXiv Detail & Related papers (2020-06-20T13:40:27Z) - Bandit Samplers for Training Graph Neural Networks [63.17765191700203]
Several sampling algorithms with variance reduction have been proposed for accelerating the training of Graph Convolution Networks (GCNs)
These sampling algorithms are not applicable to more general graph neural networks (GNNs) where the message aggregator contains learned weights rather than fixed weights, such as Graph Attention Networks (GAT)
arXiv Detail & Related papers (2020-06-10T12:48:37Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to
Balance Communication Overhead, Computational Complexity, and Convergence
Rate [28.167294398293297]
We propose a method of distributed gradient descent (SGD) with low communication load and computational complexity, and still fast.
To reduce the computational complexity in each iteration, the worker nodes approximate the directional derivatives with zeroth-order gradient estimation.
arXiv Detail & Related papers (2020-03-27T14:02:15Z) - The Simulator: Understanding Adaptive Sampling in the
Moderate-Confidence Regime [52.38455827779212]
We propose a novel technique for analyzing adaptive sampling called the em Simulator.
We prove the first instance-based lower bounds the top-k problem which incorporate the appropriate log-factors.
Our new analysis inspires a simple and near-optimal for the best-arm and top-k identification, the first em practical of its kind for the latter problem.
arXiv Detail & Related papers (2017-02-16T23:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.