Related papers: Scaling Embedding Layers in Language Models

Scaling Embedding Layers in Language Models

URL: http://arxiv.org/abs/2502.01637v1
Date: Mon, 03 Feb 2025 18:59:32 GMT
Title: Scaling Embedding Layers in Language Models
Authors: Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang,
Abstract summary: SCONE enables two new scaling strategies: increasing the number of cached $n$-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS.<n>We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.
Score: 52.47659840377581
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose SCONE ($\textbf{S}$calable, $\textbf{C}$ontextualized, $\textbf{O}$ffloaded, $\textbf{N}$-gram $\textbf{E}$mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent $n$-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached $n$-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

Related papers

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency [26.173523821684306]
A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. Experiments on large language models with $7 sim 70$ billion parameters show that $D3$ can achieve an average 1.5x speedup compared with the full-inference pipeline.
arXiv Detail & Related papers (2025-03-11T15:15:54Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
$\text{M}^{\text{3}}$: A Modular World Model over Streams of Tokens [51.65485693709418]
Token-based world models emerged as a promising modular framework, modeling dynamics over token streams while optimizing tokenization separately. In this paper, we introduce $textMtext3$, a $textbfm$odular $textbfw$orld $textbfm$odel that extends this framework. $textMtext3$ achieves several improvements from existing literature to enhance agent performance.
arXiv Detail & Related papers (2025-02-17T08:06:10Z)
ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming [14.937905258757635]
$textbfST3$ is a framework designed to accelerate MLLM inference without retraining.<n>$textbfST3$ can be seamlessly integrated into existing pre-trained MLLMs.
arXiv Detail & Related papers (2024-12-28T10:17:29Z)
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z)
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models. We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method. We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z)
Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace. We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z)
Regularized Training of Nearest Neighbor Language Models [10.994336081018043]
We build upon $k$NN-LM citepkhandelwal20generalization, which uses a pre-trained language model together with an exhaustive $k$NN search through the training data (memory bank) to achieve state-of-the-art results. We find that the added L2 regularization seems to improve the performance for high-frequency words without deteriorating the performance for low frequency ones.
arXiv Detail & Related papers (2021-09-16T23:20:24Z)
Improving Robustness and Generality of NLP Models Using Disentangled Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$. We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.