The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators
- URL: http://arxiv.org/abs/2407.11004v1
- Date: Tue, 25 Jun 2024 17:58:26 GMT
- Title: The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators
- Authors: Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala,
- Abstract summary: Large pretrained models can be used as annotators, helping replace or augment crowdworkers.
This comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls.
We propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels.
- Score: 11.056579191156498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500x.
Related papers
- Enabling Small Models for Zero-Shot Classification through Model Label Learning [50.68074833512999]
We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities.
Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL.
arXiv Detail & Related papers (2024-08-21T09:08:26Z) - Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models.
For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z) - Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance [5.009377915313077]
This work addresses the research gap of how many labelled samples are required for the specialised small models to outperform general large models.
We show that the specialised models often need only few samples (on average $10 - 1000$) to be on par or better than the general ones.
When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200%$ and even up to $1500%$ in specific cases.
arXiv Detail & Related papers (2024-02-20T08:38:24Z) - Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models [19.464992602919015]
Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions.
We show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data.
We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.
arXiv Detail & Related papers (2023-05-02T17:56:16Z) - Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data.
Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z) - Optimizing Data Collection for Machine Learning [87.37252958806856]
Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
arXiv Detail & Related papers (2022-10-03T21:19:05Z) - Eliciting and Learning with Soft Labels from Every Annotator [31.10635260890126]
We focus on efficiently eliciting soft labels from individual annotators.
We demonstrate that learning with our labels achieves comparable model performance to prior approaches.
arXiv Detail & Related papers (2022-07-02T12:03:00Z) - Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction [18.46293613612346]
We introduce zero-shot cost models which enable learned cost estimation that generalizes to unseen databases.
We suggest a new learning paradigm based on pre-trained cost models.
We show that zero-shot cost models can be used in a few-shot mode that further improves their quality by retraining them just with a small number of additional training queries on the unseen database.
arXiv Detail & Related papers (2022-01-03T10:11:35Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.