AMOM: Adaptive Masking over Masking for Conditional Masked Language
Model
- URL: http://arxiv.org/abs/2303.07457v1
- Date: Mon, 13 Mar 2023 20:34:56 GMT
- Title: AMOM: Adaptive Masking over Masking for Conditional Masked Language
Model
- Authors: Yisheng Xiao, Ruiyang Xu, Lijun Wu, Juntao Li, Tao Qin, Yan-Tie Liu,
Min Zhang
- Abstract summary: A conditional masked language model (CMLM) is one of the most versatile frameworks.
We introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder.
Our proposed model yields state-of-the-art performance on neural machine translation.
- Score: 81.55294354206923
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformer-based autoregressive (AR) methods have achieved appealing
performance for varied sequence-to-sequence generation tasks, e.g., neural
machine translation, summarization, and code generation, but suffer from low
inference efficiency. To speed up the inference stage, many non-autoregressive
(NAR) strategies have been proposed in the past few years. Among them, the
conditional masked language model (CMLM) is one of the most versatile
frameworks, as it can support many different sequence generation scenarios and
achieve very competitive performance on these tasks. In this paper, we further
introduce a simple yet effective adaptive masking over masking strategy to
enhance the refinement capability of the decoder and make the encoder
optimization easier. Experiments on \textbf{3} different tasks (neural machine
translation, summarization, and code generation) with \textbf{15} datasets in
total confirm that our proposed simple method achieves significant performance
improvement over the strong CMLM model. Surprisingly, our proposed model yields
state-of-the-art performance on neural machine translation (\textbf{34.62} BLEU
on WMT16 EN$\to$RO, \textbf{34.82} BLEU on WMT16 RO$\to$EN, and \textbf{34.84}
BLEU on IWSLT De$\to$En) and even better performance than the \textbf{AR}
Transformer on \textbf{7} benchmark datasets with at least \textbf{2.2$\times$}
speedup. Our code is available at GitHub.
Related papers
- Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - ACT-MNMT Auto-Constriction Turning for Multilingual Neural Machine
Translation [38.30649186517611]
This issue introduces an textbfunderlineAuto-textbfunderlineConstriction textbfunderlineTurning mechanism for textbfunderlineMultilingual textbfunderlineNeural textbfunderlineMachine textbfunderlineTranslation (model)
arXiv Detail & Related papers (2024-03-11T14:10:57Z) - M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation [45.79215260916687]
We propose textbf$M2Chat$, a novel unified multimodal LLM framework for generating interleaved text-image conversation.
$M3Adapter$ integrates granular low-level visual information and high-level semantic features from multi-modality prompts.
$M3FT$ fine-tuning strategy optimize disjoint groups of parameters for image-text alignment and visual-instruction.
arXiv Detail & Related papers (2023-11-29T11:30:33Z) - TranSFormer: Slow-Fast Transformer for Machine Translation [52.12212173775029]
We present a textbfSlow-textbfFast two-stream learning model, referred to as TrantextbfSFormer.
Our TranSFormer shows consistent BLEU improvements (larger than 1 BLEU point) on several machine translation benchmarks.
arXiv Detail & Related papers (2023-05-26T14:37:38Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - Tractable Control for Autoregressive Language Generation [82.79160918147852]
We propose to use tractable probabilistic models (TPMs) to impose lexical constraints in autoregressive text generation models.
We show that GeLaTo achieves state-of-the-art performance on challenging benchmarks for constrained text generation.
Our work opens up new avenues for controlling large language models and also motivates the development of more expressive TPMs.
arXiv Detail & Related papers (2023-04-15T00:19:44Z) - Universal Conditional Masked Language Pre-training for Neural Machine
Translation [29.334361879066602]
We propose CeMAT, a conditional masked language model pre-trained on large-scale bilingual and monolingual corpora.
We conduct extensive experiments and show that our CeMAT can achieve significant performance improvement for all scenarios.
arXiv Detail & Related papers (2022-03-17T10:00:33Z) - MvSR-NAT: Multi-view Subset Regularization for Non-Autoregressive
Machine Translation [0.5586191108738562]
Conditional masked language models (CMLM) have shown impressive progress in non-autoregressive machine translation (NAT)
We introduce Multi-view Subset Regularization (MvSR), a novel regularization method to improve the performance of the NAT model.
We achieve remarkable performance on three public benchmarks with 0.36-1.14 BLEU gains over previous NAT models.
arXiv Detail & Related papers (2021-08-19T02:30:38Z) - Fast Sequence Generation with Multi-Agent Reinforcement Learning [40.75211414663022]
Non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel.
We propose a simple and efficient model for Non-Autoregressive sequence Generation (NAG) with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
On MSCOCO image captioning benchmark, our NAG method achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2021-01-24T12:16:45Z) - Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules.
We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models.
Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.