Related papers: Towards smaller, faster decoder-only transformers: Architectural variants and their implications

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

URL: http://arxiv.org/abs/2404.14462v4
Date: Tue, 08 Oct 2024 09:20:56 GMT
Title: Towards smaller, faster decoder-only transformers: Architectural variants and their implications
Authors: Sathya Krishnan Suresh, Shunmugapriya P,
Abstract summary: We introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT, LinearGPT, and ConvGPT. These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent times, the research on Large Language Models (LLMs) has grown exponentially, predominantly focusing on models underpinned by the transformer architecture, as established by [1], and further developed through the decoder-only variations by [2]. Contemporary efforts in this field primarily aim to enhance model capabilities by scaling up both the architecture and data volumes utilized during training. However, the exploration into reduce these model sizes while preserving their efficacy remains scant. In this study, we introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt). These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes. We open-source the model weights and the complete codebase for these implementation for further research.

Related papers

The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting [26.76928230531243]
Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF)<n> variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks?<n>Existing models are often tightly coupled with various time-series-specific designs, making it difficult to isolate the impact of the architecture itself.<n>We propose a novel taxonomy that disentangles these designs, enabling clearer and more unified comparisons of Transformer architectures.
arXiv Detail & Related papers (2025-07-17T12:16:04Z)
Leveraging Decoder Architectures for Learned Sparse Retrieval [26.483483554222012]
Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. This study investigates the effectiveness of LSR across different transformer-based architectures.
arXiv Detail & Related papers (2025-04-25T08:04:52Z)
Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures [8.442206285783463]
Transformer-based language models have recently been at the forefront of active research in text generation. These models' advances come at the price of prohibitive training costs, with parameter counts in the billions and compute requirements measured in petaflop/s-decades. We investigate transformer-based architectures for improving model performance in a low-data regime by selectively replacing attention layers with feed-forward and quasi-recurrent neural network layers.
arXiv Detail & Related papers (2025-02-02T01:05:09Z)
Towards Neural Scaling Laws for Time Series Foundation Models [63.5211738245487]
We examine two common TSFM architectures, encoder-only and decoder-only Transformers, and investigate their scaling behavior on both ID and OOD data. Our experiments reveal that the log-likelihood loss of TSFMs exhibits similar scaling behavior in both OOD and ID settings. We provide practical guidelines for designing and scaling larger TSFMs with enhanced model capabilities.
arXiv Detail & Related papers (2024-10-16T08:23:39Z)
Research on Personalized Compression Algorithm for Pre-trained Models Based on Homomorphic Entropy Increase [2.6513322539118582]
We explore the challenges and evolution of two key technologies in the current field of AI: Vision Transformer model and Large Language Model (LLM) Vision Transformer captures global information by splitting images into small pieces, but its high reference count and compute overhead limit deployment on mobile devices. LLM has revolutionized natural language processing, but it also faces huge deployment challenges.
arXiv Detail & Related papers (2024-08-16T11:56:49Z)
State Space Model for New-Generation Network Alternative to Transformers: A Survey [52.812260379420394]
In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State Space Model (SSM), as a possible replacement for the self-attention based Transformer model, has drawn more and more attention in recent years.
arXiv Detail & Related papers (2024-04-15T07:24:45Z)
Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers [1.1499643186017316]
We propose Cross-Architecture Transfer Learning (XATL) to improve efficiency of Transformer Language Models. Methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
arXiv Detail & Related papers (2024-04-03T12:27:36Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
N-Grammer: Augmenting Transformers with latent n-grams [35.39961549040385]
We propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer.
arXiv Detail & Related papers (2022-07-13T17:18:02Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
E.T.: Entity-Transformers. Coreference augmented Neural Language Model for richer mention representations via Entity-Transformer blocks [3.42658286826597]
We present an extension over the Transformer-block architecture used in neural language models, specifically in GPT2. Our model, GPT2E, extends the Transformer layers architecture of GPT2 to Entity-Transformers, an architecture designed to handle coreference information when present.
arXiv Detail & Related papers (2020-11-10T22:28:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.