Prompt Stability Scoring for Text Annotation with Large Language Models
- URL: http://arxiv.org/abs/2407.02039v1
- Date: Tue, 2 Jul 2024 08:11:18 GMT
- Title: Prompt Stability Scoring for Text Annotation with Large Language Models
- Authors: Christopher Barrie, Elli Palaiologou, Petter Törnberg,
- Abstract summary: Researchers are increasingly using language models (LMs) for text annotation.
These approaches rely only on a prompt telling the model to return a given output according to a set of instructions.
This calls into question the replicability of classification routines.
To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability"
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package PromptStability for its estimation. Using six different datasets and twelve outcomes, we classify >150k rows of data to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.
Related papers
- PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation [22.650575388026752]
Large language models (LLMs) have revolutionized NLP research.
In-context learning enables their use as evaluation metrics for natural language generation.
We evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets.
arXiv Detail & Related papers (2024-06-26T17:56:29Z) - StablePT: Towards Stable Prompting for Few-shot Learning via Input Separation [14.341806875791288]
sysname outperforms state-of-the-art methods by 6.97% in accuracy and reduces the standard deviation by 1.92 on average.
Tests underscore its robustness and stability across 8 datasets covering various tasks.
arXiv Detail & Related papers (2024-04-30T08:01:49Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Boosted Prompt Ensembles for Large Language Models [38.402161594793775]
Methods such as chain-of-thought prompting and self-consistency have pushed the frontier of language model reasoning performance with no additional training.
We propose a prompt ensembling method for large language models, which uses a small dataset to construct a set of few shot prompts that together comprise a boosted prompt ensemble''
We show that this outperforms single-prompt output-space ensembles and bagged prompt-space ensembles on the GSM8k and AQuA datasets.
arXiv Detail & Related papers (2023-04-12T16:47:15Z) - Evaluating the Robustness of Discrete Prompts [27.919548466481583]
We conduct a systematic study of the robustness of discrete prompts.
We measure their performance in two Natural Language Inference (NLI) datasets.
Our results show that although the discrete prompt-based method remains relatively robust against perturbations to NLI inputs, they are highly sensitive to other types of perturbations such as shuffling and deletion of prompt tokens.
arXiv Detail & Related papers (2023-02-11T07:01:53Z) - Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem.
Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.
We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z) - Explaining Patterns in Data with Language Models via Interpretable
Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data.
iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions.
Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z) - RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning [84.75064077323098]
This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL)
RLPrompt is flexibly applicable to different types of LMs, such as masked gibberish (e.g., grammaBERT) and left-to-right models (e.g., GPTs)
Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods.
arXiv Detail & Related papers (2022-05-25T07:50:31Z) - Prompt Consistency for Zero-Shot Task Generalization [118.81196556175797]
In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance.
Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency.
Our approach outperforms the state-of-the-art zero-shot learner, T0, on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy.
arXiv Detail & Related papers (2022-04-29T19:18:37Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.