Seq vs Seq: An Open Suite of Paired Encoders and Decoders
- URL: http://arxiv.org/abs/2507.11412v1
- Date: Tue, 15 Jul 2025 15:31:51 GMT
- Title: Seq vs Seq: An Open Suite of Paired Encoders and Decoders
- Authors: Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme,
- Abstract summary: We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion.<n>Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes.<n>We show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective.
- Score: 37.62535961965971
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.
Related papers
- Are Decoder-Only Large Language Models the Silver Bullet for Code Search? [32.338318300589776]
This study presents the first systematic exploration of decoder-only large language models for code search.
We evaluate nine state-of-the-art decoder-only models using two fine-tuning methods, two datasets, and three model sizes.
Our findings reveal that fine-tuned CodeGemma significantly outperforms encoder-only models like UniXcoder.
arXiv Detail & Related papers (2024-10-29T17:05:25Z) - You Only Cache Once: Decoder-Decoder Architectures for Language Models [132.4064488592704]
We introduce a decoder-decoder architecture, YOCO, for large language models.
YOCO only caches key-value pairs once.
The overall model behaves like a decoder-only Transformer, although YOCO only caches once.
arXiv Detail & Related papers (2024-05-08T17:57:39Z) - Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq
Models [16.49601740473416]
We explore recipes to improve training efficiency by initializing one model from the other.
Using an encoder to warm-start seq2seq training, we show that we can match task performance of a from-scratch seq2seq model.
arXiv Detail & Related papers (2023-06-14T21:41:52Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained
Experts [2.457872341625575]
We present Video Pre-trained Transformer.
It uses four SOTA encoder models to convert a video into a sequence of compact embeddings.
It learns using an autoregressive causal language modeling loss by predicting the words spoken in YouTube videos.
arXiv Detail & Related papers (2023-03-24T17:18:40Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.