Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts
- URL: http://arxiv.org/abs/2502.03418v2
- Date: Sat, 15 Feb 2025 20:06:21 GMT
- Title: Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts
- Authors: Nikta Gohari Sadr, Sangmitra Madhusudan, Ali Emami,
- Abstract summary: We introduce the ZIP score (Zero-shot Importance of Perturbation score), a versatile metric applicable to both open and closed-source models.
We show that while both'step-by-step' and 'think' show high ZIP scores, which one is more influential depends on the model and task.
- Score: 5.397565689903148
- License:
- Abstract: Zero-shot prompting techniques have significantly improved the performance of Large Language Models (LLMs). However, we lack a clear understanding of why zero-shot prompts are so effective. For example, in the prompt "Let's think step-by-step," is "think" or "step-by-step" more crucial to its success? Existing interpretability methods, such as gradient-based and attention-based approaches, are computationally intensive and restricted to open-source models. We introduce the ZIP score (Zero-shot Importance of Perturbation score), a versatile metric applicable to both open and closed-source models, based on systematic input word perturbations. Our experiments across four recent LLMs, seven widely-used prompts, and several tasks, reveal interesting patterns in word importance. For instance, while both 'step-by-step' and 'think' show high ZIP scores, which one is more influential depends on the model and task. We validate our method using controlled experiments and compare our results with human judgments, finding that proprietary models align more closely with human intuition regarding word significance. These findings enhance our understanding of LLM behavior and contribute to developing more effective zero-shot prompts and improved model analysis.
Related papers
- Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.
This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.
We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - Large Language Models are Contrastive Reasoners [8.427805316635318]
We show how contrastive prompting significantly improves the ability of large language models to perform complex reasoning.
Our method surpasses zero-shot CoT and few-shot CoT in most arithmetic and commonsense reasoning tasks.
arXiv Detail & Related papers (2024-03-13T03:15:05Z) - A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models [35.17291316942284]
We propose a novel zero-shot document ranking approach based on Large Language Models (LLMs): the Setwise prompting approach.
Our approach complements existing prompting approaches for LLM-based zero-shot ranking: Pointwise, Pairwise, and Listwise.
arXiv Detail & Related papers (2023-10-14T05:20:02Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science [27.727207443432278]
We evaluate the zero-shot performance of two publicly accessible Large Language Models, ChatGPT and OpenAssistant.
We find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10%.
arXiv Detail & Related papers (2023-05-23T17:48:21Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z) - Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations [97.41375480696972]
We introduce Z-ICL, a new zero-shot method that closes the gap by constructing pseudo-demonstrations for a given test input.
evaluation on nine classification datasets shows that Z-ICL outperforms previous zero-shot methods by a significant margin.
arXiv Detail & Related papers (2022-12-19T21:34:26Z) - Large Language Models are Zero-Shot Reasoners [28.6899375595088]
Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples.
We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer.
Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
arXiv Detail & Related papers (2022-05-24T09:22:26Z) - What Makes Good Contrastive Learning on Small-Scale Wearable-based
Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task.
This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z) - CLUES: Few-Shot Learning Evaluation in Natural Language Understanding [81.63968985419982]
We introduce CLUES, a benchmark for evaluating the few-shot learning capabilities of NLU models.
We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks.
arXiv Detail & Related papers (2021-11-04T00:43:15Z) - Auditing and Debugging Deep Learning Models via Decision Boundaries:
Individual-level and Group-level Analysis [0.0]
We use flip points to explain, audit, and debug deep learning models.
A flip point is any point that lies on the boundary between two output classes.
We demonstrate our methods by investigating several models trained on standard datasets used in social applications of machine learning.
arXiv Detail & Related papers (2020-01-03T01:45:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.