Annotation Imputation to Individualize Predictions: Initial Studies on
Distribution Dynamics and Model Predictions
- URL: http://arxiv.org/abs/2305.15070v3
- Date: Thu, 5 Oct 2023 07:10:25 GMT
- Title: Annotation Imputation to Individualize Predictions: Initial Studies on
Distribution Dynamics and Model Predictions
- Authors: London Lowmanstone, Ruyuan Wan, Risako Owan, Jaehyung Kim, Dongyeop
Kang
- Abstract summary: We propose using imputation methods to generate the opinions of all annotators for all examples.
We then train and prompt models, using data from the imputed dataset, to make predictions about the distribution of responses and individual annotations.
- Score: 20.74423180342303
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Annotating data via crowdsourcing is time-consuming and expensive. Due to
these costs, dataset creators often have each annotator label only a small
subset of the data. This leads to sparse datasets with examples that are marked
by few annotators. The downside of this process is that if an annotator doesn't
get to label a particular example, their perspective on it is missed. This is
especially concerning for subjective NLP datasets where there is no single
correct label: people may have different valid opinions. Thus, we propose using
imputation methods to generate the opinions of all annotators for all examples,
creating a dataset that does not leave out any annotator's view. We then train
and prompt models, using data from the imputed dataset, to make predictions
about the distribution of responses and individual annotations.
In our analysis of the results, we found that the choice of imputation method
significantly impacts soft label changes and distribution. While the imputation
introduces noise in the prediction of the original dataset, it has shown
potential in enhancing shots for prompts, particularly for low-response-rate
annotators. We have made all of our code and data publicly available.
Related papers
- From Random to Informed Data Selection: A Diversity-Based Approach to
Optimize Human Annotation and Few-Shot Learning [38.30983556062276]
A major challenge in Natural Language Processing is obtaining annotated data for supervised learning.
Crowdsourcing introduces issues related to the annotator's experience, consistency, and biases.
This paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning.
arXiv Detail & Related papers (2024-01-24T04:57:32Z) - Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks [9.110872603799839]
Supervised classification heavily depends on datasets annotated by humans.
In subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters.
In this work, we propose textbfAnnotator Awares for Texts (AART) for subjective classification tasks.
arXiv Detail & Related papers (2023-11-16T10:18:32Z) - IDEAL: Influence-Driven Selective Annotations Empower In-Context
Learners in Large Language Models [66.32043210237768]
This paper introduces an influence-driven selective annotation method.
It aims to minimize annotation costs while improving the quality of in-context examples.
Experiments confirm the superiority of the proposed method on various benchmarks.
arXiv Detail & Related papers (2023-10-16T22:53:54Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Are labels informative in semi-supervised learning? -- Estimating and
leveraging the missing-data mechanism [4.675583319625962]
Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models.
It can be affected by the presence of informative'' labels, which occur when some classes are more likely to be labeled than others.
We propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm.
arXiv Detail & Related papers (2023-02-15T09:18:46Z) - SeedBERT: Recovering Annotator Rating Distributions from an Aggregated
Label [43.23903984174963]
We propose SeedBERT, a method for recovering annotator rating distributions from a single label.
Our human evaluations indicate that SeedBERT's attention mechanism is consistent with human sources of annotator disagreement.
arXiv Detail & Related papers (2022-11-23T18:35:15Z) - Efficient Few-Shot Fine-Tuning for Opinion Summarization [83.76460801568092]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples.
We show that a few-shot method based on adapters can easily store in-domain knowledge.
We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets.
arXiv Detail & Related papers (2022-05-04T16:38:37Z) - Active Learning by Feature Mixing [52.16150629234465]
We propose a novel method for batch active learning called ALFA-Mix.
We identify unlabelled instances with sufficiently-distinct features by seeking inconsistencies in predictions.
We show that inconsistencies in these predictions help discovering features that the model is unable to recognise in the unlabelled instances.
arXiv Detail & Related papers (2022-03-14T12:20:54Z) - Multi-label Classification with Partial Annotations using Class-aware
Selective Loss [14.3159150577502]
Large-scale multi-label classification datasets are commonly partially annotated.
We analyze the partial labeling problem, then propose a solution based on two key ideas.
With our novel approach, we achieve state-of-the-art results on OpenImages dataset.
arXiv Detail & Related papers (2021-10-21T08:10:55Z) - Learning with Noisy Labels by Targeted Relabeling [52.0329205268734]
Crowdsourcing platforms are often used to collect datasets for training deep neural networks.
We propose an approach which reserves a fraction of annotations to explicitly relabel highly probable labeling errors.
arXiv Detail & Related papers (2021-10-15T20:37:29Z) - Instance Correction for Learning with Open-set Noisy Labels [145.06552420999986]
We use the sample selection approach to handle open-set noisy labels.
The discarded data are seen to be mislabeled and do not participate in training.
We modify the instances of discarded data to make predictions for the discarded data consistent with given labels.
arXiv Detail & Related papers (2021-06-01T13:05:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.