Related papers: Prompt Stability Scoring for Text Annotation with Large Language Models

Prompt Stability Scoring for Text Annotation with Large Language Models

URL: http://arxiv.org/abs/2407.02039v1
Date: Tue, 2 Jul 2024 08:11:18 GMT
Title: Prompt Stability Scoring for Text Annotation with Large Language Models
Authors: Christopher Barrie, Elli Palaiologou, Petter Törnberg,
Abstract summary: Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability"
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package PromptStability for its estimation. Using six different datasets and twelve outcomes, we classify >150k rows of data to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.

Related papers

Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction [12.92060812931049]
Minor changes in prompt can cause significant discrepancies in model performance. We propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions. Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts.
arXiv Detail & Related papers (2025-04-04T04:39:51Z)
AutoBencher: Towards Declarative Benchmark Construction [74.54640925146289]
We use AutoBencher to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks.
arXiv Detail & Related papers (2024-07-11T10:03:47Z)
ELCoRec: Enhance Language Understanding with Co-Propagation of Numerical and Categorical Features for Recommendation [38.64175351885443]
Large language models have been flourishing in the natural language processing (NLP) domain. Despite the intelligence shown by the recommendation-oriented finetuned models, LLMs struggle to fully understand the user behavior patterns. Existing works only fine-tune a sole LLM on given text data without introducing that important information to it.
arXiv Detail & Related papers (2024-06-27T01:37:57Z)
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation [22.650575388026752]
Large language models (LLMs) have revolutionized NLP research. In-context learning enables their use as evaluation metrics for natural language generation. We evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets.
arXiv Detail & Related papers (2024-06-26T17:56:29Z)
StablePT: Towards Stable Prompting for Few-shot Learning via Input Separation [14.341806875791288]
sysname outperforms state-of-the-art methods by 6.97% in accuracy and reduces the standard deviation by 1.92 on average. Tests underscore its robustness and stability across 8 datasets covering various tasks.
arXiv Detail & Related papers (2024-04-30T08:01:49Z)
Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization [64.62570402941387]
We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain. Our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe.
arXiv Detail & Related papers (2023-11-02T17:59:32Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
Boosted Prompt Ensembles for Large Language Models [38.402161594793775]
Methods such as chain-of-thought prompting and self-consistency have pushed the frontier of language model reasoning performance with no additional training. We propose a prompt ensembling method for large language models, which uses a small dataset to construct a set of few shot prompts that together comprise a boosted prompt ensemble'' We show that this outperforms single-prompt output-space ensembles and bagged prompt-space ensembles on the GSM8k and AQuA datasets.
arXiv Detail & Related papers (2023-04-12T16:47:15Z)
Evaluating the Robustness of Discrete Prompts [27.919548466481583]
We conduct a systematic study of the robustness of discrete prompts. We measure their performance in two Natural Language Inference (NLI) datasets. Our results show that although the discrete prompt-based method remains relatively robust against perturbations to NLI inputs, they are highly sensitive to other types of perturbations such as shuffling and deletion of prompt tokens.
arXiv Detail & Related papers (2023-02-11T07:01:53Z)
Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z)
Explaining Patterns in Data with Language Models via Interpretable Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z)
Prompt Consistency for Zero-Shot Task Generalization [118.81196556175797]
In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance. Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency. Our approach outperforms the state-of-the-art zero-shot learner, T0, on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy.
arXiv Detail & Related papers (2022-04-29T19:18:37Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction [111.74812895391672]
We propose a Knowledge-aware Prompt-tuning approach with synergistic optimization (KnowPrompt) We inject latent knowledge contained in relation labels into prompt construction with learnable virtual type words and answer words.
arXiv Detail & Related papers (2021-04-15T17:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.