Related papers: Nearest Neighbor Knowledge Distillation for Neural Machine Translation

Nearest Neighbor Knowledge Distillation for Neural Machine Translation

URL: http://arxiv.org/abs/2205.00479v1
Date: Sun, 1 May 2022 14:30:49 GMT
Title: Nearest Neighbor Knowledge Distillation for Neural Machine Translation
Authors: Zhixian Yang, Renliang Sun, Xiaojun Wan
Abstract summary: k-nearest-neighbor machine translation (NN-MT) has achieved many state-of-the-art results in machine translation tasks. NN-KD trains the base NMT model to directly learn the knowledge of NN.
Score: 50.0624778757462
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: k-nearest-neighbor machine translation (NN-MT), proposed by Khandelwal et al. (2021), has achieved many state-of-the-art results in machine translation tasks. Although effective, NN-MT requires conducting NN searches through the large datastore for each decoding step during inference, prohibitively increasing the decoding cost and thus leading to the difficulty for the deployment in real-world applications. In this paper, we propose to move the time-consuming NN search forward to the preprocessing phase, and then introduce Nearest Neighbor Knowledge Distillation (NN-KD) that trains the base NMT model to directly learn the knowledge of NN. Distilling knowledge retrieved by NN can encourage the NMT model to take more reasonable target tokens into consideration, thus addressing the overcorrection problem. Extensive experimental results show that, the proposed method achieves consistent improvement over the state-of-the-art baselines including NN-MT, while maintaining the same training and decoding speed as the standard NMT model.

Related papers

Code-Switching with Word Senses for Pretraining in Neural Machine Translation [107.23743153715799]
We introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT) WSP-NMT is an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality.
arXiv Detail & Related papers (2023-10-21T16:13:01Z)
A DPLL(T) Framework for Verifying Deep Neural Networks [9.422860826278788]
Like human-written software, Deep Neural Networks (DNNs) can have bugs and can be attacked. We introduce NeuralSAT, a new verification approach that adapts the widely-used DPLL(T) algorithm used in modern SMT solvers.
arXiv Detail & Related papers (2023-07-17T18:49:46Z)
Towards Robust k-Nearest-Neighbor Machine Translation [72.9252395037097]
k-Nearest-Neighbor Machine Translation (kNN-MT) becomes an important research direction of NMT in recent years. Its main idea is to retrieve useful key-value pairs from an additional datastore to modify translations without updating the NMT model. The underlying retrieved noisy pairs will dramatically deteriorate the model performance. We propose a confidence-enhanced kNN-MT model with robust training to alleviate the impact of noise.
arXiv Detail & Related papers (2022-10-17T07:43:39Z)
Adaptive Nearest Neighbor Machine Translation [60.97183408140499]
kNN-MT combines pre-trained neural machine translation with token-level k-nearest-neighbor retrieval. Traditional kNN algorithm simply retrieves a same number of nearest neighbors for each target token. We propose Adaptive kNN-MT to dynamically determine the number of k for each target token.
arXiv Detail & Related papers (2021-05-27T09:27:42Z)
Progressive Tandem Learning for Pattern Recognition with Deep Spiking Neural Networks [80.15411508088522]
Spiking neural networks (SNNs) have shown advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency. We propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition.
arXiv Detail & Related papers (2020-07-02T15:38:44Z)
Understanding Learning Dynamics for Neural Machine Translation [53.23463279153577]
We propose to understand learning dynamics of NMT by using Loss Change Allocation (LCA)citeplan 2019-loss-change-allocation. As LCA requires calculating the gradient on an entire dataset for each update, we instead present an approximate to put it into practice in NMT scenario. Our simulated experiment shows that such approximate calculation is efficient and is empirically proved to deliver consistent results.
arXiv Detail & Related papers (2020-04-05T13:32:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.