Nearest Neighbor Knowledge Distillation for Neural Machine Translation
- URL: http://arxiv.org/abs/2205.00479v1
- Date: Sun, 1 May 2022 14:30:49 GMT
- Title: Nearest Neighbor Knowledge Distillation for Neural Machine Translation
- Authors: Zhixian Yang, Renliang Sun, Xiaojun Wan
- Abstract summary: k-nearest-neighbor machine translation (NN-MT) has achieved many state-of-the-art results in machine translation tasks.
NN-KD trains the base NMT model to directly learn the knowledge of NN.
- Score: 50.0624778757462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: k-nearest-neighbor machine translation (NN-MT), proposed by Khandelwal et al.
(2021), has achieved many state-of-the-art results in machine translation
tasks. Although effective, NN-MT requires conducting NN searches through the
large datastore for each decoding step during inference, prohibitively
increasing the decoding cost and thus leading to the difficulty for the
deployment in real-world applications. In this paper, we propose to move the
time-consuming NN search forward to the preprocessing phase, and then introduce
Nearest Neighbor Knowledge Distillation (NN-KD) that trains the base NMT model
to directly learn the knowledge of NN. Distilling knowledge retrieved by NN can
encourage the NMT model to take more reasonable target tokens into
consideration, thus addressing the overcorrection problem. Extensive
experimental results show that, the proposed method achieves consistent
improvement over the state-of-the-art baselines including NN-MT, while
maintaining the same training and decoding speed as the standard NMT model.
Related papers
- Code-Switching with Word Senses for Pretraining in Neural Machine
Translation [107.23743153715799]
We introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT)
WSP-NMT is an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases.
Our experiments show significant improvements in overall translation quality.
arXiv Detail & Related papers (2023-10-21T16:13:01Z) - A DPLL(T) Framework for Verifying Deep Neural Networks [9.422860826278788]
Like human-written software, Deep Neural Networks (DNNs) can have bugs and can be attacked.
We introduce NeuralSAT, a new verification approach that adapts the widely-used DPLL(T) algorithm used in modern SMT solvers.
arXiv Detail & Related papers (2023-07-17T18:49:46Z) - Towards Robust k-Nearest-Neighbor Machine Translation [72.9252395037097]
k-Nearest-Neighbor Machine Translation (kNN-MT) becomes an important research direction of NMT in recent years.
Its main idea is to retrieve useful key-value pairs from an additional datastore to modify translations without updating the NMT model.
The underlying retrieved noisy pairs will dramatically deteriorate the model performance.
We propose a confidence-enhanced kNN-MT model with robust training to alleviate the impact of noise.
arXiv Detail & Related papers (2022-10-17T07:43:39Z) - Adaptive Nearest Neighbor Machine Translation [60.97183408140499]
kNN-MT combines pre-trained neural machine translation with token-level k-nearest-neighbor retrieval.
Traditional kNN algorithm simply retrieves a same number of nearest neighbors for each target token.
We propose Adaptive kNN-MT to dynamically determine the number of k for each target token.
arXiv Detail & Related papers (2021-05-27T09:27:42Z) - Progressive Tandem Learning for Pattern Recognition with Deep Spiking
Neural Networks [80.15411508088522]
Spiking neural networks (SNNs) have shown advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency.
We propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition.
arXiv Detail & Related papers (2020-07-02T15:38:44Z) - Understanding Learning Dynamics for Neural Machine Translation [53.23463279153577]
We propose to understand learning dynamics of NMT by using Loss Change Allocation (LCA)citeplan 2019-loss-change-allocation.
As LCA requires calculating the gradient on an entire dataset for each update, we instead present an approximate to put it into practice in NMT scenario.
Our simulated experiment shows that such approximate calculation is efficient and is empirically proved to deliver consistent results.
arXiv Detail & Related papers (2020-04-05T13:32:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.