Abstract: This paper studies the global convergence of gradient descent for deep ReLU
networks under the square loss. For this setting, the current state-of-the-art
results show that gradient descent converges to a global optimum if the widths
of all the hidden layers scale at least as $\Omega(N^8)$ ($N$ being the number
of training samples). In this paper, we discuss a simple proof framework which
allows us to improve the existing over-parameterization condition to linear,
quadratic and cubic widths (depending on the type of initialization scheme
and/or the depth of the network).