Iterative Refinement in the Continuous Space for Non-Autoregressive
Neural Machine Translation
- URL: http://arxiv.org/abs/2009.07177v1
- Date: Tue, 15 Sep 2020 15:30:14 GMT
- Title: Iterative Refinement in the Continuous Space for Non-Autoregressive
Neural Machine Translation
- Authors: Jason Lee, Raphael Shu, Kyunghyun Cho
- Abstract summary: We propose an efficient inference procedure for non-autoregressive machine translation.
It iteratively refines translation purely in the continuous space.
We evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En.
- Score: 68.25872110275542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an efficient inference procedure for non-autoregressive machine
translation that iteratively refines translation purely in the continuous
space. Given a continuous latent variable model for machine translation (Shu et
al., 2020), we train an inference network to approximate the gradient of the
marginal log probability of the target sentence, using only the latent variable
as input. This allows us to use gradient-based optimization to find the target
sentence at inference time that approximately maximizes its marginal
probability. As each refinement step only involves computation in the latent
space of low dimensionality (we use 8 in our experiments), we avoid
computational overhead incurred by existing non-autoregressive inference
procedures that often refine in token space. We compare our approach to a
recently proposed EM-like inference procedure (Shu et al., 2020) that optimizes
in a hybrid space, consisting of both discrete and continuous variables. We
evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En, and
observe two advantages over the EM-like inference: (1) it is computationally
efficient, i.e. each refinement step is twice as fast, and (2) it is more
effective, resulting in higher marginal probabilities and BLEU scores with the
same number of refinement steps. On WMT'14 En-De, for instance, our approach is
able to decode 6.2 times faster than the autoregressive model with minimal
degradation to translation quality (0.9 BLEU).
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods [75.34939761152587]
Efficient computation of the optimal transport distance between two distributions serves as an algorithm that empowers various applications.
This paper develops a scalable first-order optimization-based method that computes optimal transport to within $varepsilon$ additive accuracy.
arXiv Detail & Related papers (2023-01-30T15:46:39Z) - Convergence of the mini-batch SIHT algorithm [0.0]
The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations.
We show that the sequence generated by the sparse mini-batch SIHT is a supermartingale sequence and converges with probability one.
arXiv Detail & Related papers (2022-09-29T03:47:46Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z) - A Variance Controlled Stochastic Method with Biased Estimation for
Faster Non-convex Optimization [0.0]
We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG)
$lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG.
$mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
arXiv Detail & Related papers (2021-02-19T12:22:56Z) - On Stochastic Variance Reduced Gradient Method for Semidefinite
Optimization [14.519696724619074]
The SVRG method has been regarded as one of the most effective methods.
There is a significant gap between the theory and practice when adapted to the semidefinite programming (SDP)
In this paper, we fill this gap via exploiting a new variant of the original SVRG with Option I adapted to the semidefinite optimization.
arXiv Detail & Related papers (2021-01-01T13:55:32Z) - Unbiased Gradient Estimation for Variational Auto-Encoders using Coupled
Markov Chains [34.77971292478243]
The variational auto-encoder (VAE) is a deep latent variable model that has two neural networks in an autoencoder-like architecture.
We develop a training scheme for VAEs by introducing unbiased estimators of the log-likelihood gradient.
We show experimentally that VAEs fitted with unbiased estimators exhibit better predictive performance.
arXiv Detail & Related papers (2020-10-05T08:11:55Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.