Related papers: Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

URL: http://arxiv.org/abs/2411.03042v1
Date: Tue, 05 Nov 2024 12:26:25 GMT
Title: Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning
Authors: Bei Li, Tong Zheng, Rui Wang, Jiahao Liu, Qingyan Guo, Junliang Guo, Xu Tan, Tong Xiao, Jingbo Zhu, Jingang Wang, Xunliang Cai,
Abstract summary: We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
Score: 73.73967342609603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.'' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT'14 English-German and English-French tasks, our model achieved BLEU scores of 30.95 and 44.27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5.7 accuracy points on the LM Harness Evaluation.

Related papers

AMSFL: Adaptive Multi-Step Federated Learning via Gradient Difference-Based Error Modeling [0.0]
Federated learning faces critical challenges in balancing communication efficiency and model accuracy.<n>One key issue lies in the approximation of update errors without incurring high computational costs.<n>We propose a lightweight yet effective method called Gradient Difference Approximation (GDA)
arXiv Detail & Related papers (2025-05-27T19:32:00Z)
A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
Human Evaluation of English--Irish Transformer-Based NMT [2.648836772989769]
Best-performing Transformer system significantly reduces both accuracy and errors when compared with an RNN-based model. When benchmarked against Google Translate, our translation engines demonstrated significant improvements.
arXiv Detail & Related papers (2024-03-04T11:45:46Z)
DF2: Distribution-Free Decision-Focused Learning [53.2476224456902]
Decision-focused learning (DFL) has recently emerged as a powerful approach for predictthen-optimize problems. Existing end-to-end DFL methods are hindered by three significant bottlenecks: model error, sample average approximation error, and distribution-based parameterization of the expected objective. We present DF2 -- the first textit-free decision-focused learning method explicitly designed to address these three bottlenecks.
arXiv Detail & Related papers (2023-08-11T00:44:46Z)
INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation [5.837035655563323]
We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized Large Language Models. Our method reduces the memory requirements by up to 5.6 times, which enables fine-tuning a 7 billion parameter Large Language Model (LLM) on consumer laptops.
arXiv Detail & Related papers (2023-06-13T22:25:35Z)
Non-parametric, Nearest-neighbor-assisted Fine-tuning for Neural Machine Translation [22.59222643493867]
Non-parametric, k-nearest-neighbor algorithms have recently made inroads to assist generative models such as language models and machine translation decoders. We explore whether such non-parametric models can improve machine translation models at the fine-tuning stage by incorporating statistics from the kNN predictions.
arXiv Detail & Related papers (2023-05-23T03:44:06Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Scaling Laws for Neural Machine Translation [21.76567580425173]
We show that cross-entropy loss as a function of model size follows a certain scaling law. We also investigate the relationship between the cross-entropy loss and the quality of the translations generated.
arXiv Detail & Related papers (2021-09-16T06:15:20Z)
Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization. We show how to practically optimize this objective for large translation corpora using an iterated best response scheme. Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.