Related papers: AMOM: Adaptive Masking over Masking for Conditional Masked Language Model

AMOM: Adaptive Masking over Masking for Conditional Masked Language Model

URL: http://arxiv.org/abs/2303.07457v1
Date: Mon, 13 Mar 2023 20:34:56 GMT
Title: AMOM: Adaptive Masking over Masking for Conditional Masked Language Model
Authors: Yisheng Xiao, Ruiyang Xu, Lijun Wu, Juntao Li, Tao Qin, Yan-Tie Liu, Min Zhang
Abstract summary: A conditional masked language model (CMLM) is one of the most versatile frameworks. We introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder. Our proposed model yields state-of-the-art performance on neural machine translation.
Score: 81.55294354206923
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Transformer-based autoregressive (AR) methods have achieved appealing performance for varied sequence-to-sequence generation tasks, e.g., neural machine translation, summarization, and code generation, but suffer from low inference efficiency. To speed up the inference stage, many non-autoregressive (NAR) strategies have been proposed in the past few years. Among them, the conditional masked language model (CMLM) is one of the most versatile frameworks, as it can support many different sequence generation scenarios and achieve very competitive performance on these tasks. In this paper, we further introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder and make the encoder optimization easier. Experiments on \textbf{3} different tasks (neural machine translation, summarization, and code generation) with \textbf{15} datasets in total confirm that our proposed simple method achieves significant performance improvement over the strong CMLM model. Surprisingly, our proposed model yields state-of-the-art performance on neural machine translation (\textbf{34.62} BLEU on WMT16 EN$\to$RO, \textbf{34.82} BLEU on WMT16 RO$\to$EN, and \textbf{34.84} BLEU on IWSLT De$\to$En) and even better performance than the \textbf{AR} Transformer on \textbf{7} benchmark datasets with at least \textbf{2.2$\times$} speedup. Our code is available at GitHub.

Related papers

Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU [18.470239387359094]
We propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU.<n>We show that STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.
arXiv Detail & Related papers (2025-06-06T13:54:34Z)
Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision [7.668943487262671]
We propose Diffusion-Driven Prompt Generator (Diff-Prompt) to generate rich and fine-grained prompt information.<n>Diff-Prompt achieves a maximum improvement of 8.87 in R@1 and 14.05 in R@5 compared to the foundation model.
arXiv Detail & Related papers (2025-04-30T08:28:38Z)
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding [0.0]
Speculative decoding accelerates large language model inference. We introduce textitGammaTune and textitGammaTune+, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates.
arXiv Detail & Related papers (2025-03-28T23:41:55Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
ACT-MNMT Auto-Constriction Turning for Multilingual Neural Machine Translation [38.30649186517611]
This issue introduces an textbfunderlineAuto-textbfunderlineConstriction textbfunderlineTurning mechanism for textbfunderlineMultilingual textbfunderlineNeural textbfunderlineMachine textbfunderlineTranslation (model)
arXiv Detail & Related papers (2024-03-11T14:10:57Z)
M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation [45.79215260916687]
We propose textbf$M2Chat$, a novel unified multimodal LLM framework for generating interleaved text-image conversation. $M3Adapter$ integrates granular low-level visual information and high-level semantic features from multi-modality prompts. $M3FT$ fine-tuning strategy optimize disjoint groups of parameters for image-text alignment and visual-instruction.
arXiv Detail & Related papers (2023-11-29T11:30:33Z)
Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers [66.823588073584]
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications. Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs. We propose a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs.
arXiv Detail & Related papers (2023-10-02T02:01:16Z)
TranSFormer: Slow-Fast Transformer for Machine Translation [52.12212173775029]
We present a textbfSlow-textbfFast two-stream learning model, referred to as TrantextbfSFormer. Our TranSFormer shows consistent BLEU improvements (larger than 1 BLEU point) on several machine translation benchmarks.
arXiv Detail & Related papers (2023-05-26T14:37:38Z)
Extrapolating Multilingual Understanding Models as Multilingual Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model. We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z)
Tractable Control for Autoregressive Language Generation [82.79160918147852]
We propose to use tractable probabilistic models (TPMs) to impose lexical constraints in autoregressive text generation models. We show that GeLaTo achieves state-of-the-art performance on challenging benchmarks for constrained text generation. Our work opens up new avenues for controlling large language models and also motivates the development of more expressive TPMs.
arXiv Detail & Related papers (2023-04-15T00:19:44Z)
A Template-based Method for Constrained Neural Machine Translation [100.02590022551718]
We propose a template-based method that can yield results with high translation quality and match accuracy while keeping the decoding speed. The generation and derivation of the template can be learned through one sequence-to-sequence training framework. Experimental results show that the proposed template-based methods can outperform several representative baselines in lexically and structurally constrained translation tasks.
arXiv Detail & Related papers (2022-05-23T12:24:34Z)
Universal Conditional Masked Language Pre-training for Neural Machine Translation [29.334361879066602]
We propose CeMAT, a conditional masked language model pre-trained on large-scale bilingual and monolingual corpora. We conduct extensive experiments and show that our CeMAT can achieve significant performance improvement for all scenarios.
arXiv Detail & Related papers (2022-03-17T10:00:33Z)
MvSR-NAT: Multi-view Subset Regularization for Non-Autoregressive Machine Translation [0.5586191108738562]
Conditional masked language models (CMLM) have shown impressive progress in non-autoregressive machine translation (NAT) We introduce Multi-view Subset Regularization (MvSR), a novel regularization method to improve the performance of the NAT model. We achieve remarkable performance on three public benchmarks with 0.36-1.14 BLEU gains over previous NAT models.
arXiv Detail & Related papers (2021-08-19T02:30:38Z)
Fast Sequence Generation with Multi-Agent Reinforcement Learning [40.75211414663022]
Non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel. We propose a simple and efficient model for Non-Autoregressive sequence Generation (NAG) with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL) On MSCOCO image captioning benchmark, our NAG method achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2021-01-24T12:16:45Z)
Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules. We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.