Related papers: Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

URL: http://arxiv.org/abs/2306.08756v1
Date: Wed, 14 Jun 2023 21:41:52 GMT
Title: Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models
Authors: Saleh Soltan, Andy Rosenbaum, Tobias Falke, Qin Lu, Anna Rumshisky, Wael Hamza
Abstract summary: We explore recipes to improve training efficiency by initializing one model from the other. Using an encoder to warm-start seq2seq training, we show that we can match task performance of a from-scratch seq2seq model.
Score: 16.49601740473416
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%.

Related papers

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z)
Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image. The proposed approach reduces expected encoder computational cost while maintaining performance. It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z)
Exploiting the Potential of Seq2Seq Models as Robust Few-Shot Learners [8.43854206194162]
We show that seq2seq models can be highly effective few-shot learners for a wide spectrum of applications. We propose two methods to more effectively elicit in-context learning ability in seq2seq models: objective-aligned prompting and a fusion-based approach.
arXiv Detail & Related papers (2023-07-27T13:37:06Z)
A Framework for Bidirectional Decoding: Case Study in Morphological Inflection [4.602447284133507]
We propose a framework for decoding sequences from the "outside-in" At each step, the model chooses to generate a token on the left, on the right, or join the left and right sequences. Our model sets state-of-the-art (SOTA) on the 2022 and 2023 shared tasks, beating the next best systems by over 4.7 and 2.7 points in average accuracy respectively.
arXiv Detail & Related papers (2023-05-21T22:08:31Z)
TimeMAE: Self-Supervised Representations of Time Series with Decoupled Masked Autoencoders [55.00904795497786]
We propose TimeMAE, a novel self-supervised paradigm for learning transferrable time series representations based on transformer networks. The TimeMAE learns enriched contextual representations of time series with a bidirectional encoding scheme. To solve the discrepancy issue incurred by newly injected masked embeddings, we design a decoupled autoencoder architecture.
arXiv Detail & Related papers (2023-03-01T08:33:16Z)
Hierarchical Phrase-based Sequence-to-Sequence Learning [94.10257313923478]
We describe a neural transducer that maintains the flexibility of standard sequence-to-sequence (seq2seq) models while incorporating hierarchical phrases as a source of inductive bias during training and as explicit constraints during inference. Our approach trains two models: a discriminative derivation based on a bracketing grammar whose tree hierarchically aligns source and target phrases, and a neural seq2seq model that learns to translate the aligned phrases one-by-one.
arXiv Detail & Related papers (2022-11-15T05:22:40Z)
Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder. Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z)
E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation [95.49128988683191]
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. We propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2. E2S2 improves the seq2seq models via integrating more efficient self-supervised information into the encoders.
arXiv Detail & Related papers (2022-05-30T08:25:36Z)
Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
Efficient Decoding of Surface Code Syndromes for Error Correction in Quantum Computing [0.09236074230806578]
We propose a two-level (low and high) ML-based decoding scheme, where the first level corrects errors on physical qubits and the second one corrects any existing logical errors. Our results show that our proposed decoding method achieves $sim10 times$ and $sim2 times$ higher values of pseudo-threshold and threshold respectively. We show that usage of more sophisticated ML models with higher training/testing time, do not provide significant improvement in the decoder performance.
arXiv Detail & Related papers (2021-10-21T04:54:44Z)
Tiny Neural Models for Seq2Seq [0.0]
We propose a projection based encoder-decoder model referred to as pQRNN-MAtt. The resulting quantized models are less than 3.5MB in size and are well suited for on-device latency critical applications. We show that on MTOP, a challenging multilingual semantic parsing dataset, the average model performance surpasses LSTM based seq2seq model that uses pre-trained embeddings despite being 85x smaller.
arXiv Detail & Related papers (2021-08-07T00:39:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.