Related papers: Which Words Matter Most in Zero-Shot Prompts?

Which Words Matter Most in Zero-Shot Prompts?

URL: http://arxiv.org/abs/2502.03418v3
Date: Mon, 29 Sep 2025 16:29:27 GMT
Title: Which Words Matter Most in Zero-Shot Prompts?
Authors: Nikta Gohari Sadr, Sangmitra Madhusudan, Hassan Sajjad, Ali Emami,
Abstract summary: ZIP score is the first systematic method to quantify individual word importance in instructional prompts.<n>We show that task-specific word hierarchies exist where mathematical problems prioritize "step-by-step" while reasoning tasks favor "think"<n>We establish the first ground-truth benchmark for prompt interpretability through 20 validation prompts with predetermined key words.
Score: 16.347012287506253
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While zero-shot instructional prompts like "Let's think step-by-step" have revolutionized Large Language Model performance, a fundamental question remains unanswered: which specific words drive their remarkable effectiveness? We introduce the ZIP score (Zero-shot Importance of Perturbation), the first systematic method to quantify individual word importance in instructional prompts through controlled perturbations including synonym replacement, co-hyponym substitution, and strategic removal. Our analysis across four flagship models, seven widely-adopted prompts, and multiple task domains reveals four key findings: (1) Task-specific word hierarchies exist where mathematical problems prioritize "step-by-step" while reasoning tasks favor "think"; (2) Proprietary models show superior alignment with human intuitions compared to open-source alternatives; (3) Nouns dominate importance rankings, consistently representing the majority of significant words; and (4) Word importance inversely correlates with model performance, indicating prompts have greatest impact where models struggle most. Beyond revealing these patterns, we establish the first ground-truth benchmark for prompt interpretability through 20 validation prompts with predetermined key words, where ZIP achieves 90% accuracy versus LIME's 60%. Our findings advance prompt science, the study of how language shapes model behavior, providing both practical insights for prompt engineering and theoretical understanding of word-level effects in LLMs.

Related papers

ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase Generation [21.10770048637475]
We propose ERU-KG, an unsupervised keyphrase generation (UKG) model that consists of an informativeness and a phraseness module.<n>ERU-KG demonstrates its effectiveness on keyphrase generation benchmarks by outperforming unsupervised baselines and achieving on average 89% of the performance of a supervised model for top 10 predictions.
arXiv Detail & Related papers (2025-05-30T05:09:53Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning [5.4141465747474475]
Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving problems of moderate complexity.<n>We systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph.
arXiv Detail & Related papers (2025-02-19T20:20:24Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
Large Language Models are Contrastive Reasoners [8.427805316635318]
We show how contrastive prompting significantly improves the ability of large language models to perform complex reasoning. Our method surpasses zero-shot CoT and few-shot CoT in most arithmetic and commonsense reasoning tasks.
arXiv Detail & Related papers (2024-03-13T03:15:05Z)
Word Importance Explains How Prompts Affect Language Model Outputs [0.7223681457195862]
This study presents a method to improve the explainability of large language models by varying individual words in prompts. Unlike classical attention, word importance measures the impact of prompt words on arbitrarily-defined text scores. Results show that word importance scores are closely related to the expected suffix importances for multiple scoring functions.
arXiv Detail & Related papers (2024-03-05T15:04:18Z)
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z)
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models [35.17291316942284]
We propose a novel zero-shot document ranking approach based on Large Language Models (LLMs): the Setwise prompting approach. Our approach complements existing prompting approaches for LLM-based zero-shot ranking: Pointwise, Pairwise, and Listwise.
arXiv Detail & Related papers (2023-10-14T05:20:02Z)
Instruction-following Evaluation through Verbalizer Manipulation [64.73188776428799]
We propose a novel instruction-following evaluation protocol called verbalizer manipulation. It instructs the model to verbalize the task label with words aligning with model priors to different extents. We observe that the instruction-following abilities of models, across different families and scales, are significantly distinguished by their performance on less natural verbalizers.
arXiv Detail & Related papers (2023-07-20T03:54:24Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
Assessing Word Importance Using Models Trained for Semantic Tasks [0.0]
We derive word significance from models trained to solve semantic task: Natural Language Inference and Paraphrase Identification. We evaluate their relevance using a so-called cross-task evaluation. Our method can be used to identify important words in sentences without any explicit word importance labeling in training.
arXiv Detail & Related papers (2023-05-31T09:34:26Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science [27.727207443432278]
We evaluate the zero-shot performance of two publicly accessible Large Language Models, ChatGPT and OpenAssistant. We find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10%.
arXiv Detail & Related papers (2023-05-23T17:48:21Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations [97.41375480696972]
We introduce Z-ICL, a new zero-shot method that closes the gap by constructing pseudo-demonstrations for a given test input. evaluation on nine classification datasets shows that Z-ICL outperforms previous zero-shot methods by a significant margin.
arXiv Detail & Related papers (2022-12-19T21:34:26Z)
Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models. We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)
Large Language Models are Zero-Shot Reasoners [28.6899375595088]
Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples. We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
arXiv Detail & Related papers (2022-05-24T09:22:26Z)
CLUES: Few-Shot Learning Evaluation in Natural Language Understanding [81.63968985419982]
We introduce CLUES, a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks.
arXiv Detail & Related papers (2021-11-04T00:43:15Z)
My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples. We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms. We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.