Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification
- URL: http://arxiv.org/abs/2410.00179v2
- Date: Wed, 2 Oct 2024 01:50:17 GMT
- Title: Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification
- Authors: Kush Dubey,
- Abstract summary: Few-shot learning benchmarks are critical for evaluating modern NLP techniques.
It is possible, however, that benchmarks favor methods which easily make use of unlabeled text.
We run experiments to quantify the bias caused by pretraining on unlabeled test set text.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Few-shot learning benchmarks are critical for evaluating modern NLP techniques. It is possible, however, that benchmarks favor methods which easily make use of unlabeled text, because researchers can use unlabeled text from the test set to pretrain their models. Given the dearth of research on this potential problem, we run experiments to quantify the bias caused by pretraining on unlabeled test set text instead of on unlabeled, independently drawn text. Controlled few-shot and zero-shot experiments on 25 classification tasks and 3 language models -- BERT, GPT-2, and Mistral 7B -- do not find evidence of overoptimism. Furthermore, we demonstrate the importance of repeated subsampling when studying few-shot text classification, and recommend that few-shot learning benchmarks include multiple training folds. Code and data are available at https://github.com/kddubey/pretrain-on-test/.
Related papers
- Test-Time Adaptation with Binary Feedback [50.20923012663613]
BiTTA is a novel dual-path optimization framework that balances binary feedback-guided adaptation on uncertain samples with agreement-based self-adaptation on confident predictions.<n> Experiments show BiTTA achieves 13.3%p accuracy improvements over state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-24T05:24:10Z) - Identifying Fairness Issues in Automatically Generated Testing Content [9.698044538202442]
We review test content generated for a large-scale standardized English proficiency test with the goal of identifying content that only pertains to a certain subset of the test population.
We build a dataset of 601 generated texts annotated for fairness and explore a variety of methods for classification.
We find that combining prompt self-correction and few-shot learning performs best, yielding an F1 score of 0.79 on our held-out test set.
arXiv Detail & Related papers (2024-04-23T14:56:15Z) - Detecting Pretraining Data from Large Language Models [90.12037980837738]
We study the pretraining data detection problem.
Given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
We introduce a new detection method Min-K% Prob based on a simple hypothesis.
arXiv Detail & Related papers (2023-10-25T17:21:23Z) - Rank-Aware Negative Training for Semi-Supervised Text Classification [3.105629960108712]
Semi-supervised text classification-based paradigms (SSTC) typically employ the spirit of self-training.
This paper presents a Rank-aware Negative Training (RNT) framework to address SSTC in learning with noisy label manner.
arXiv Detail & Related papers (2023-06-13T08:41:36Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Pre-trained Language Models Can be Fully Zero-Shot Learners [26.60008734311909]
We propose nonparametric prompting PLM (NPPrompt) for fully zero-shot language understanding.
NPPrompt uses only pre-trained language models and does not require any labeled data or additional raw corpus for further fine-tuning.
We evaluate NPPrompt against previous major few-shot and zero-shot learning methods on diverse NLP tasks.
arXiv Detail & Related papers (2022-12-14T00:03:52Z) - Test-Time Adaptation via Self-Training with Nearest Neighbor Information [16.346069386394703]
Adapting trained classifiers using only online test data is important.
One of the popular approaches for test-time adaptation is self-training.
We propose a novel test-time adaptation method Test-time Adaptation via Self-Training with nearest neighbor information.
arXiv Detail & Related papers (2022-07-08T05:02:15Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning
Tasks [14.547623982073475]
Deep learning systems are notoriously difficult to test and debug.
It is essential to conduct test selection and label only those selected "high quality" bug-revealing test inputs for test cost reduction.
We propose a novel test prioritization technique that brings order into the unlabeled test instances according to their bug-revealing capabilities, namely TestRank.
arXiv Detail & Related papers (2021-05-21T03:41:10Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.