Leveraging Diverse Modeling Contexts with Collaborating Learning for
Neural Machine Translation
- URL: http://arxiv.org/abs/2402.18428v1
- Date: Wed, 28 Feb 2024 15:55:02 GMT
- Title: Leveraging Diverse Modeling Contexts with Collaborating Learning for
Neural Machine Translation
- Authors: Yusheng Liao and Yanfeng Wang and Yu Wang
- Abstract summary: Autoregressive (AR) and Non-autoregressive (NAR) models are two types of generative models for Neural Machine Translation (NMT)
We propose a novel generic collaborative learning method, DCMCL, where AR and NAR models are treated as collaborators instead of teachers and students.
- Score: 26.823126615724888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) and Non-autoregressive (NAR) models are two types of
generative models for Neural Machine Translation (NMT). AR models predict
tokens in a word-by-word manner and can effectively capture the distribution of
real translations. NAR models predict tokens by extracting bidirectional
contextual information which can improve the inference speed but they suffer
from performance degradation. Previous works utilized AR models to enhance NAR
models by reducing the training data's complexity or incorporating the global
information into AR models by virtue of NAR models. However, those investigated
methods only take advantage of the contextual information of a single type of
model while neglecting the diversity in the contextual information that can be
provided by different types of models. In this paper, we propose a novel
generic collaborative learning method, DCMCL, where AR and NAR models are
treated as collaborators instead of teachers and students. To hierarchically
leverage the bilateral contextual information, token-level mutual learning and
sequence-level contrastive learning are adopted between AR and NAR models.
Extensive experiments on four widely used benchmarks show that the proposed
DCMCL method can simultaneously improve both AR and NAR models with up to 1.38
and 2.98 BLEU scores respectively, and can also outperform the current
best-unified model with up to 0.97 BLEU scores for both AR and NAR decoding.
Related papers
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models [105.70889434492143]
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling.
We show that we can convert AR models ranging from 127M to 7B parameters into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training.
Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts.
arXiv Detail & Related papers (2024-10-23T14:04:22Z) - Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation [15.632419297059993]
Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT)
A performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty capturing independency between target words accurately.
We apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models.
arXiv Detail & Related papers (2024-05-02T13:39:28Z) - Revisiting N-Gram Models: Their Impact in Modern Neural Networks for Handwritten Text Recognition [4.059708117119894]
This study addresses whether explicit language models, specifically n-gram models, still contribute to the performance of state-of-the-art deep learning architectures in the field of handwriting recognition.
We evaluate two prominent neural network architectures, PyLaia and DAN, with and without the integration of explicit n-gram language models.
The results show that incorporating character or subword n-gram models significantly improves the performance of ATR models on all datasets.
arXiv Detail & Related papers (2024-04-30T07:37:48Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Non-Autoregressive Machine Translation: It's Not as Fast as it Seems [84.47091735503979]
We point out flaws in the evaluation methodology present in the literature on NAR models.
We compare NAR models with other widely used methods for improving efficiency.
We call for more realistic and extensive evaluation of NAR models in future work.
arXiv Detail & Related papers (2022-05-04T09:30:17Z) - Diformer: Directional Transformer for Neural Machine Translation [13.867255817435705]
Autoregressive (AR) and Non-autoregressive (NAR) models have their own superiority on the performance and latency.
We propose the Directional Transformer (Diformer) by jointly modelling AR and NAR into three generation directions.
Experiments on 4 WMT benchmarks demonstrate that Diformer outperforms current united-modelling works with more than 1.5 BLEU points for both AR and NAR decoding.
arXiv Detail & Related papers (2021-12-22T02:35:29Z) - A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text
Generation [59.64193903397301]
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
We conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR)
The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances.
arXiv Detail & Related papers (2021-10-11T13:05:06Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - A Study of Non-autoregressive Model for Sequence Generation [147.89525760170923]
Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel.
We propose knowledge distillation and source-target alignment to bridge the gap between AR and NAR models.
arXiv Detail & Related papers (2020-04-22T09:16:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.