Abstract: We introduce MADGRAD, a novel optimization method in the family of AdaGrad
adaptive gradient methods. MADGRAD shows excellent performance on deep learning
optimization problems from multiple fields, including classification and
image-to-image tasks in vision, and recurrent and bidirectionally-masked models
in natural language processing. For each of these tasks, MADGRAD matches or
outperforms both SGD and ADAM in test set performance, even on problems for
which adaptive methods normally perform poorly.