Task Contamination: Language Models May Not Be Few-Shot Anymore
- URL: http://arxiv.org/abs/2312.16337v1
- Date: Tue, 26 Dec 2023 21:17:46 GMT
- Title: Task Contamination: Language Models May Not Be Few-Shot Anymore
- Authors: Changmao Li and Jeffrey Flanigan
- Abstract summary: Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks.
However, their success in zero-shot and few-shot settings may be affected by task contamination.
This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time.
- Score: 9.696290050028237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) offer impressive performance in various
zero-shot and few-shot tasks. However, their success in zero-shot and few-shot
settings may be affected by task contamination, a potential limitation that has
not been thoroughly examined. This paper investigates how zero-shot and
few-shot performance of LLMs has changed chronologically over time. Utilizing
GPT-3 series models and several other recent open-sourced LLMs, and controlling
for dataset difficulty, we find that on datasets released before the LLM
training data creation date, LLMs perform surprisingly better than on datasets
released after. This strongly indicates that, for many LLMs, there exists task
contamination on zero-shot and few-shot evaluation for datasets released prior
to the LLMs' training data creation date. Additionally, we utilize training
data inspection, task example extraction, and a membership inference attack,
which reveal further evidence of task contamination. Importantly, we find that
for classification tasks with no possibility of task contamination, LLMs rarely
demonstrate statistically significant improvements over simple majority
baselines, in both zero and few-shot settings.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources.
This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions.
We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations.
We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [62.02920842630234]
We show how to build small models that have GPT-4-level performance but for 400x lower cost.
We unify pre-existing datasets into a benchmark LLM-AggreFact.
Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy.
arXiv Detail & Related papers (2024-04-16T17:59:10Z) - Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models [21.10890310571397]
We introduce a variety of different techniques to assess whether a language model has seen a dataset during training.
We compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training.
We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting.
arXiv Detail & Related papers (2024-04-09T10:58:21Z) - How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
With the rise of Large Language Models (LLMs) in recent years, new opportunities are emerging, but also new challenges, and contamination is quickly becoming critical.
Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars.
It is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - Elephants Never Forget: Testing Language Models for Memorization of
Tabular Data [21.912611415307644]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over.
We introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization.
arXiv Detail & Related papers (2024-03-11T12:07:13Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.