Related papers: Return of the Encoder: Maximizing Parameter Efficiency for SLMs

Return of the Encoder: Maximizing Parameter Efficiency for SLMs

URL: http://arxiv.org/abs/2501.16273v2
Date: Thu, 30 Jan 2025 16:44:45 GMT
Title: Return of the Encoder: Maximizing Parameter Efficiency for SLMs
Authors: Mohamed Elfeki, Rui Liu, Chad Voegele,
Abstract summary: encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices.<n>We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers.
Score: 4.246337121596753
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters or fewer - our systematic analysis across GPU, CPU, and NPU platforms reveals that encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices. These gains may be attributed to encoder-decoder's one-time input processing and efficient separation of understanding and generation phases. We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers while preserving their architectural advantages, achieving up to 6 average performance points improvement across diverse tasks, with significant gains in asymmetric sequence tasks where input and output distributions can benefit from different processing approaches. When combined with modern advances like Rotary Positional Embeddings (RoPE) and Vision encoders, our systematic investigation demonstrates that encoder-decoder architectures provide a more practical path toward deploying capable language models in resource-constrained environments. Our findings challenge the prevailing trend toward decoder-only scaling, showing that architectural choices become increasingly crucial as parameter budgets decrease, particularly for on-device and edge deployments where computational efficiency is paramount.

Related papers

Leveraging Decoder Architectures for Learned Sparse Retrieval [26.483483554222012]
Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. This study investigates the effectiveness of LSR across different transformer-based architectures.
arXiv Detail & Related papers (2025-04-25T08:04:52Z)
Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks [24.674661807982865]
We introduce Gemma, adapting the powerful decoder model to an encoder architecture. To optimize the adaptation from decoder to encoder, we analyze various pooling strategies. We benchmark Gemma against established approaches on the GLUE benchmarks, and MS MARCO ranking benchmark.
arXiv Detail & Related papers (2025-03-04T14:17:00Z)
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z)
Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z)
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z)
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models [29.468888611690346]
We propose a simple and effective framework, NASH, that narrows the encoder and shortens the decoder networks of encoder-decoder models. Our findings highlight two insights: (1) the number of decoder layers is the dominant factor of inference speed, and (2) low sparsity in the pruned encoder network enhances generation quality.
arXiv Detail & Related papers (2023-10-16T04:27:36Z)
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence. Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text. Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z)
Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models. We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words. Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z)
Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD) As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks. This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.