Related papers: Towards Pareto Optimal Throughput in Small Language Model Serving

Towards Pareto Optimal Throughput in Small Language Model Serving

URL: http://arxiv.org/abs/2404.03353v1
Date: Thu, 4 Apr 2024 10:45:07 GMT
Title: Towards Pareto Optimal Throughput in Small Language Model Serving
Authors: Pol G. Recasens, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral,
Abstract summary: Small Language Models (SLMs) offer new opportunities for resource-constrained users. We present a set of experiments designed to benchmark SLM inference at performance and energy levels.
Score: 4.497936996651617
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

Related papers

Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks [4.448709087838503]
Small Language Models (SLMs) offer lightweight and cost-effective alternatives to Large Language Models (LLMs)<n>This study presents a comprehensive empirical evaluation of 20 open-source SLMs ranging from 0.4B to 10B parameters on five code-related benchmarks.
arXiv Detail & Related papers (2025-07-03T20:32:36Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
Efficient Multitask Learning in Small Language Models Through Upside-Down Reinforcement Learning [8.995427413172148]
Small language models (SLMs) can achieve competitive performance in multitask prompt generation tasks. We train an SLM that achieves relevance scores within 5% of state-of-the-art models, including Llama-3, Qwen2, and Mistral, despite being up to 80 times smaller.
arXiv Detail & Related papers (2025-02-14T01:39:45Z)
Small Language Models (SLMs) Can Still Pack a Punch: A survey [2.7309692684728617]
Survey of about 160 papers presents a family of Small Language Models (SLMs) in the 1 to 8 billion parameter range. We explore task agnostic, general purpose SLMs, task-specific SLMs and techniques to create SLMs that can guide the community to build models.
arXiv Detail & Related papers (2025-01-03T19:53:57Z)
A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources. We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z)
Stacking Small Language Models for Generalizability [0.0]
Large language models (LLMs) generalize strong performance across different natural language benchmarks. This paper introduces a new approach called fine-tuning stacks of language models (FSLM) By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for. As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language.
arXiv Detail & Related papers (2024-10-21T01:27:29Z)
Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT) We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training. Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z)
One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z)
SLMRec: Empowering Small Language Models for Sequential Recommendation [38.51895517016953]
Sequential Recommendation task involves predicting the next item a user is likely to interact with, given their past interactions. Recent research demonstrates the great impact of LLMs on sequential recommendation systems. Due to the huge size of LLMs, it is inefficient and impractical to apply a LLM-based model in real-world platforms.
arXiv Detail & Related papers (2024-05-28T07:12:06Z)
LLM Augmented LLMs: Expanding Capabilities through Composition [56.40953749310957]
CALM -- Composition to Augment Language Models -- introduces cross-attention between models to compose their representations and enable new capabilities. We illustrate that augmenting PaLM2-S with a smaller model trained on low-resource languages results in an absolute improvement of up to 13% on tasks like translation into English. When PaLM2-S is augmented with a code-specific model, we see a relative improvement of 40% over the base model for code generation and explanation tasks.
arXiv Detail & Related papers (2024-01-04T18:53:01Z)
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks. However, the massive size of these models poses huge challenges for their deployment in real-world applications. We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z)
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.