Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance
- URL: http://arxiv.org/abs/2402.12819v2
- Date: Fri, 26 Apr 2024 08:20:40 GMT
- Title: Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance
- Authors: Branislav Pecher, Ivan Srba, Maria Bielikova,
- Abstract summary: This work addresses the research gap of how many labelled samples are required for the specialised small models to outperform general large models.
We show that the specialised models often need only few samples (on average $10 - 1000$) to be on par or better than the general ones.
When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200%$ and even up to $1500%$ in specific cases.
- Score: 5.009377915313077
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When solving NLP tasks with limited labelled data, researchers can either use a general large language model without further update, or use a small number of labelled examples to tune a specialised smaller model. In this work, we address the research gap of how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 7 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average $10 - 1000$) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with this number being significantly lower on multi-class datasets (up to $100$) than on binary datasets (up to $5000$). When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200\%$ and even up to $1500\%$ in specific cases.
Related papers
- Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling [21.762562172089236]
We build specialist models from large generalist training sets instead.
We adjust the training distribution of the generalist data with guidance from the limited domain-specific data.
It is scalable, suitable for pretraining and continued pretraining, it works well in multi-task settings.
arXiv Detail & Related papers (2024-09-30T20:49:54Z) - Target-Aware Language Modeling via Granular Data Sampling [25.957424920194914]
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources.
A cost-effective and straightforward approach is sampling with low-dimensional data features.
We show that pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
arXiv Detail & Related papers (2024-09-23T04:52:17Z) - The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators [11.056579191156498]
Large pretrained models can be used as annotators, helping replace or augment crowdworkers.
This comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls.
We propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels.
arXiv Detail & Related papers (2024-06-25T17:58:26Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - A General Model for Aggregating Annotations Across Simple, Complex, and
Multi-Object Annotation Tasks [51.14185612418977]
A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels.
While a variety of bespoke models have been proposed for specific tasks, our work is the first to introduce aggregation methods that generalize across many diverse complex tasks.
This article extends our prior work with investigation of three new research questions.
arXiv Detail & Related papers (2023-12-20T21:28:35Z) - GistScore: Learning Better Representations for In-Context Example
Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts.
We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning.
We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - USB: A Unified Summarization Benchmark Across Tasks and Domains [68.82726887802856]
We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks.
We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models.
arXiv Detail & Related papers (2023-05-23T17:39:54Z) - Few-shot learning approaches for classifying low resource domain
specific software requirements [1.1470070927586016]
Few-shot learning is a type of deep learning that uses only a few annotated samples.
Our experiments focus on classifying BOSCH automotive domain textual software requirements into 3 categories.
While SciBERT and DeBERTa based models tend to be the most accurate at 15 training samples, their performance improvement scales minimally as the number of annotated samples is increased to 50 in comparison to Siamese and T5 based models.
arXiv Detail & Related papers (2023-02-14T10:19:23Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z) - Low Resource Multi-Task Sequence Tagging -- Revisiting Dynamic
Conditional Random Fields [67.51177964010967]
We compare different models for low resource multi-task sequence tagging that leverage dependencies between label sequences for different tasks.
We find that explicit modeling of inter-dependencies between task predictions outperforms single-task as well as standard multi-task models.
arXiv Detail & Related papers (2020-05-01T07:11:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.