Abstract: Several recent works have shown separation results between deep neural
networks, and hypothesis classes with inferior approximation capacity such as
shallow networks or kernel classes. On the other hand, the fact that deep
networks can efficiently express a target function does not mean this target
function can be learned efficiently by deep neural networks. In this work we
study the intricate connection between learnability and approximation capacity.
We show that learnability with deep networks of a target function depends on
the ability of simpler classes to approximate the target. Specifically, we show
that a necessary condition for a function to be learnable by gradient descent
on deep neural networks is to be able to approximate the function, at least in
a weak sense, with shallow neural networks. We also show that a class of
functions can be learned by an efficient statistical query algorithm if and
only if it can be approximated in a weak sense by some kernel class. We give
several examples of functions which demonstrate depth separation, and conclude
that they cannot be efficiently learned, even by a hypothesis class that can
efficiently approximate them.