Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
- URL: http://arxiv.org/abs/2510.26622v1
- Date: Thu, 30 Oct 2025 15:48:28 GMT
- Title: Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
- Authors: Biao Zhang, Yong Cheng, Siamak Shakeri, Xinyi Wang, Min Ma, Orhan Firat,
- Abstract summary: We revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM)<n>We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales.<n>Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance.
- Score: 30.945523139748634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.
Related papers
- Language Ranker: A Lightweight Ranking framework for LLM Decoding [70.01564145836129]
This paper conceptualizes the decoding process as analogous to the ranking stage in recommendation pipelines.<n>Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses.<n> Experiments show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only 0.5M additional parameters.
arXiv Detail & Related papers (2025-10-23T17:56:46Z) - Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z) - Should We Still Pretrain Encoders with Masked Language Modeling? [27.19054714197245]
Recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders.<n>We train a total of 38 models ranging from 210 million to 1 billion parameters, and conduct over 15,000 fine-tuning and evaluation runs.<n>We find that while training with high-level CLM yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability.
arXiv Detail & Related papers (2025-07-01T17:45:48Z) - DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation [68.19756761027351]
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models.<n>We investigate their denoising processes and reinforcement learning methods.<n>Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
arXiv Detail & Related papers (2025-06-25T17:35:47Z) - Zero-Shot Vision Encoder Grafting via LLM Surrogates [65.37227522413689]
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM)<n>We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM.<n> Vision encoders trained on the surrogate can then be directly transferred to the larger model.
arXiv Detail & Related papers (2025-05-28T17:59:59Z) - Leveraging Decoder Architectures for Learned Sparse Retrieval [26.483483554222012]
Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures.<n>This study investigates the effectiveness of LSR across different transformer-based architectures.
arXiv Detail & Related papers (2025-04-25T08:04:52Z) - Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation [52.19855651708349]
We study a novel problem: adapting decoder-only large language models to encoder-decoder models.<n>We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation.<n>Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart.
arXiv Detail & Related papers (2025-04-08T17:13:41Z) - Are Decoder-Only Large Language Models the Silver Bullet for Code Search? [44.9422305001193]
Code search is essential for code reuse, allowing developers to efficiently locate relevant code snippets.<n>Powerful decoder-only Large Language Models (LLMs) has revolutionized many code intelligence tasks.<n>This paper presents a systematic evaluation of eleven decoder-only LLMs, analyzing their performance across zero-shot and fine-tuned settings.
arXiv Detail & Related papers (2024-10-29T17:05:25Z) - Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding [15.723047976314751]
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following.
We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding.
arXiv Detail & Related papers (2024-02-26T18:59:28Z) - Machine Learning-Aided Efficient Decoding of Reed-Muller Subcodes [59.55193427277134]
Reed-Muller (RM) codes achieve the capacity of general binary-input memoryless symmetric channels.
RM codes only admit limited sets of rates.
Efficient decoders are available for RM codes at finite lengths.
arXiv Detail & Related papers (2023-01-16T04:11:14Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.