Straightening Out the Straight-Through Estimator: Overcoming
Optimization Challenges in Vector Quantized Networks
- URL: http://arxiv.org/abs/2305.08842v1
- Date: Mon, 15 May 2023 17:56:36 GMT
- Title: Straightening Out the Straight-Through Estimator: Overcoming
Optimization Challenges in Vector Quantized Networks
- Authors: Minyoung Huh, Brian Cheung, Pulkit Agrawal, Phillip Isola
- Abstract summary: This work examines the challenges of training neural networks using vector quantization using straight-through estimation.
We find that a primary cause of training instability is the discrepancy between the model embedding and the code-vector distribution.
We identify the factors that contribute to this issue, including the codebook gradient sparsity and the asymmetric nature of the commitment loss.
- Score: 35.6604960300194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work examines the challenges of training neural networks using vector
quantization using straight-through estimation. We find that a primary cause of
training instability is the discrepancy between the model embedding and the
code-vector distribution. We identify the factors that contribute to this
issue, including the codebook gradient sparsity and the asymmetric nature of
the commitment loss, which leads to misaligned code-vector assignments. We
propose to address this issue via affine re-parameterization of the code
vectors. Additionally, we introduce an alternating optimization to reduce the
gradient error introduced by the straight-through estimation. Moreover, we
propose an improvement to the commitment loss to ensure better alignment
between the codebook representation and the model embedding. These optimization
methods improve the mathematical approximation of the straight-through
estimation and, ultimately, the model performance. We demonstrate the
effectiveness of our methods on several common model architectures, such as
AlexNet, ResNet, and ViT, across various tasks, including image classification
and generative modeling.
Related papers
- Gaussian Mixture Vector Quantization with Aggregated Categorical Posterior [5.862123282894087]
We introduce the Vector Quantized Variational Autoencoder (VQ-VAE)
VQ-VAE is a type of variational autoencoder using discrete embedding as latent.
We show that GM-VQ improves codebook utilization and reduces information loss without relying on handcrafteds.
arXiv Detail & Related papers (2024-10-14T05:58:11Z) - Adaptive operator learning for infinite-dimensional Bayesian inverse problems [7.716833952167609]
We develop an adaptive operator learning framework that can reduce modeling error gradually by forcing the surrogate to be accurate in local areas.
We present a rigorous convergence guarantee in the linear case using the UKI framework.
The numerical results show that our method can significantly reduce computational costs while maintaining inversion accuracy.
arXiv Detail & Related papers (2023-10-27T01:50:33Z) - Deep Graph Reprogramming [112.34663053130073]
"Deep graph reprogramming" is a model reusing task tailored for graph neural networks (GNNs)
We propose an innovative Data Reprogramming paradigm alongside a Model Reprogramming paradigm.
arXiv Detail & Related papers (2023-04-28T02:04:29Z) - Bayesian Graph Contrastive Learning [55.36652660268726]
We propose a novel perspective of graph contrastive learning methods showing random augmentations leads to encoders.
Our proposed method represents each node by a distribution in the latent space in contrast to existing techniques which embed each node to a deterministic vector.
We show a considerable improvement in performance compared to existing state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-12-15T01:45:32Z) - Adaptive Projected Residual Networks for Learning Parametric Maps from
Sparse Data [5.920947681019466]
We present a parsimonious surrogate framework for learning high dimensional parametric maps from limited training data.
These applications include such "outer-loop" problems as Bayesian inverse problems, optimal experimental design, and optimal design and control under uncertainty.
arXiv Detail & Related papers (2021-12-14T01:29:19Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Modeling Design and Control Problems Involving Neural Network Surrogates [1.1602089225841632]
We consider nonlinear optimization problems that involve surrogate models represented by neural networks.
We show how to directly embed neural network evaluation into optimization models, highlight a difficulty with this approach that can prevent convergence.
We present two alternative formulations of these problems in the specific case of feedforward neural networks with ReLU activation.
arXiv Detail & Related papers (2021-11-20T01:09:15Z) - Cogradient Descent for Dependable Learning [64.02052988844301]
We propose a dependable learning based on Cogradient Descent (CoGD) algorithm to address the bilinear optimization problem.
CoGD is introduced to solve bilinear problems when one variable is with sparsity constraint.
It can also be used to decompose the association of features and weights, which further generalizes our method to better train convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-06-20T04:28:20Z) - Adaptive Importance Sampling for Finite-Sum Optimization and Sampling
with Decreasing Step-Sizes [4.355567556995855]
We propose Avare, a simple and efficient algorithm for adaptive importance sampling for finite-sum optimization and sampling with decreasing step-sizes.
Under standard technical conditions, we show that Avare achieves $mathcalO(T2/3)$ and $mathcalO(T5/6)$ dynamic regret for SGD and SGLD respectively when run with $mathcalO(T5/6)$ step sizes.
arXiv Detail & Related papers (2021-03-23T00:28:15Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.