Related papers: ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

URL: http://arxiv.org/abs/2410.12405v1
Date: Wed, 16 Oct 2024 09:38:13 GMT
Title: ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Authors: Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, Kai Chen,
Abstract summary: ProSA is a framework designed to evaluate and comprehend prompt sensitivity in large language models. Our study uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness.
Score: 72.13489820420726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but their performance is highly sensitive to the prompts utilized. This variability poses challenges for accurate assessment and user satisfaction. Current research frequently overlooks instance-level prompt variations and their implications on subjective evaluations. To address these shortcomings, we introduce ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs. ProSA incorporates a novel sensitivity metric, PromptSensiScore, and leverages decoding confidence to elucidate underlying mechanisms. Our extensive study, spanning multiple tasks, uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness. We observe that few-shot examples can alleviate this sensitivity issue, and subjective evaluations are also susceptible to prompt sensitivities, particularly in complex, reasoning-oriented tasks. Furthermore, our findings indicate that higher model confidence correlates with increased prompt robustness. We believe this work will serve as a helpful tool in studying prompt sensitivity of LLMs. The project is released at: https://github.com/open-compass/ProSA .

Related papers

Benchmarking Prompt Sensitivity in Large Language Models [13.986971540998258]
Large language Models (LLMs) are highly sensitive to variations in prompt formulation. This paper introduces a new task, Prompt Sensitivity Prediction, and a dataset designed to investigate the effects of slight prompt variations on LLM performance.
arXiv Detail & Related papers (2025-02-09T23:01:03Z)
Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning [61.17099595835263]
Gensors is a system that empowers users to define customized sensors supported by the reasoning capabilities of MLLMs. In a user study, participants reported significantly greater sense of control, understanding, and ease of communication when defining sensors using Gensors.
arXiv Detail & Related papers (2025-01-27T01:47:57Z)
POSIX: A Prompt Sensitivity Index For Large Language Models [22.288479270814484]
Large Language Models (LLMs) are surprisingly sensitive to minor variations in prompts. POSIX is a novel PrOmpt Sensitivity IndeX as a reliable measure of prompt sensitivity.
arXiv Detail & Related papers (2024-10-03T04:01:14Z)
Do Large Language Models Possess Sensitive to Sentiment? [18.88126980975737]
Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. This paper investigates the ability of LLMs to detect and react to sentiment in text modal.
arXiv Detail & Related papers (2024-09-04T01:40:20Z)
How Susceptible are LLMs to Influence in Prompts? [6.644673474240519]
Large Language Models (LLMs) are highly sensitive to prompts, including additional context provided therein. We study how an LLM's response to multiple-choice questions changes when the prompt includes a prediction and explanation from another model. Our findings reveal that models are strongly influenced, and when explanations are provided they are swayed irrespective of the quality of the explanation.
arXiv Detail & Related papers (2024-08-17T17:40:52Z)
Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established. This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt. We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z)
MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries? [70.77691645678804]
Humans are prone to cognitive distortions -- biased thinking patterns that lead to exaggerated responses to specific stimuli. This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies. We identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive.
arXiv Detail & Related papers (2024-06-22T23:26:07Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)
LLMs for Targeted Sentiment in News Headlines: Exploring the Descriptive-Prescriptive Dilemma [0.0]
This paper compares the accuracy of state-of-the-art LLMs and fine-tuned encoder models for targeted sentiment analysis of news headlines. We analyze how performance is affected by prompt prescriptiveness, ranging from plain zero-shot to elaborate few-shot prompts. We find that LLMs outperform fine-tuned encoders on descriptive datasets, while calibration and F1-score generally improve with increased prescriptiveness.
arXiv Detail & Related papers (2024-03-01T10:10:34Z)
How are Prompts Different in Terms of Sensitivity? [50.67313477651395]
We present a comprehensive prompt analysis based on the sensitivity of a function. We use gradient-based saliency scores to empirically demonstrate how different prompts affect the relevance of input tokens to the output. We introduce sensitivity-aware decoding which incorporates sensitivity estimation as a penalty term in the standard greedy decoding.
arXiv Detail & Related papers (2023-11-13T10:52:01Z)
Balancing Robustness and Sensitivity using Feature Contrastive Learning [95.86909855412601]
Methods that promote robustness can hurt the model's sensitivity to rare or underrepresented patterns. We propose Feature Contrastive Learning (FCL) that encourages a model to be more sensitive to the features that have higher contextual utility.
arXiv Detail & Related papers (2021-05-19T20:53:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.