Related papers: Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

URL: http://arxiv.org/abs/2402.12819v2
Date: Fri, 26 Apr 2024 08:20:40 GMT
Title: Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance
Authors: Branislav Pecher, Ivan Srba, Maria Bielikova,
Abstract summary: This work addresses the research gap of how many labelled samples are required for the specialised small models to outperform general large models. We show that the specialised models often need only few samples (on average $10 - 1000$) to be on par or better than the general ones. When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200%$ and even up to $1500%$ in specific cases.
Score: 5.009377915313077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When solving NLP tasks with limited labelled data, researchers can either use a general large language model without further update, or use a small number of labelled examples to tune a specialised smaller model. In this work, we address the research gap of how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 7 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average $10 - 1000$) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with this number being significantly lower on multi-class datasets (up to $100$) than on binary datasets (up to $5000$). When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200\%$ and even up to $1500\%$ in specific cases.

Related papers

Large Language Models in the Task of Automatic Validation of Text Classifier Predictions [55.2480439325792]
Machine learning models for text classification are trained to predict a class for a given text.<n>To do this, training and validation samples must be prepared, and each text is assigned a class.<n>Human annotators are usually assigned by human annotators with different expertise levels, depending on the specific classification task.<n>This paper proposes several approaches to replace human annotators with Large Language Models.
arXiv Detail & Related papers (2025-05-24T13:19:03Z)
Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts [13.21626568246313]
We analyze whether vision-language foundation models can be adapted to target datasets with very different distributions and classes.<n>We propose a novel prompt-tuning method, PromptMargin, for adapting such large-scale VLMs directly on the few target samples.<n>PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules.
arXiv Detail & Related papers (2025-05-21T13:26:56Z)
Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts. This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling [21.762562172089236]
We build specialist models from large generalist training sets instead. We adjust the training distribution of the generalist data with guidance from the limited domain-specific data. It is scalable, suitable for pretraining and continued pretraining, it works well in multi-task settings.
arXiv Detail & Related papers (2024-09-30T20:49:54Z)
Target-Aware Language Modeling via Granular Data Sampling [25.957424920194914]
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. A cost-effective and straightforward approach is sampling with low-dimensional data features. We show that pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
arXiv Detail & Related papers (2024-09-23T04:52:17Z)
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [81.34900892130929]
We explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model.<n>Across multiple tasks and models, we observe that coverage scales with the number of samples over four orders of magnitude.<n>In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance.
arXiv Detail & Related papers (2024-07-31T17:57:25Z)
The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators [11.056579191156498]
Large pretrained models can be used as annotators, helping replace or augment crowdworkers. This comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls. We propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels.
arXiv Detail & Related papers (2024-06-25T17:58:26Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks [51.14185612418977]
A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels. While a variety of bespoke models have been proposed for specific tasks, our work is the first to introduce aggregation methods that generalize across many diverse complex tasks. This article extends our prior work with investigation of three new research questions.
arXiv Detail & Related papers (2023-12-20T21:28:35Z)
GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts. We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning. We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z)
Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z)
USB: A Unified Summarization Benchmark Across Tasks and Domains [68.82726887802856]
We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models.
arXiv Detail & Related papers (2023-05-23T17:39:54Z)
Few-shot learning approaches for classifying low resource domain specific software requirements [1.1470070927586016]
Few-shot learning is a type of deep learning that uses only a few annotated samples. Our experiments focus on classifying BOSCH automotive domain textual software requirements into 3 categories. While SciBERT and DeBERTa based models tend to be the most accurate at 15 training samples, their performance improvement scales minimally as the number of annotated samples is increased to 50 in comparison to Siamese and T5 based models.
arXiv Detail & Related papers (2023-02-14T10:19:23Z)
Simplicity Bias Leads to Amplified Performance Disparities [8.60453031364566]
We show that SGD-trained models have a bias towards simplicity, leading them to prioritize learning a majority class. A model may prioritize any class or group of the dataset that it finds simple-at the expense of what it finds complex.
arXiv Detail & Related papers (2022-12-13T15:24:41Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets. Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
Low Resource Multi-Task Sequence Tagging -- Revisiting Dynamic Conditional Random Fields [67.51177964010967]
We compare different models for low resource multi-task sequence tagging that leverage dependencies between label sequences for different tasks. We find that explicit modeling of inter-dependencies between task predictions outperforms single-task as well as standard multi-task models.
arXiv Detail & Related papers (2020-05-01T07:11:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.