Related papers: Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science

Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science

URL: http://arxiv.org/abs/2305.14310v3
Date: Sun, 24 Mar 2024 18:03:10 GMT
Title: Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science
Authors: Yida Mu, Ben P. Wu, William Thorne, Ambrose Robinson, Nikolaos Aletras, Carolina Scarton, Kalina Bontcheva, Xingyi Song,
Abstract summary: We evaluate the zero-shot performance of two publicly accessible Large Language Models, ChatGPT and OpenAssistant. We find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10%.
Score: 27.727207443432278
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10\%.

Related papers

CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs [81.79228604962687]
This work investigates whether small-scale LMs can benefit from instruction tuning.<n>We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum.<n>Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data.<n>However, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization.
arXiv Detail & Related papers (2025-10-29T10:36:39Z)
Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity [0.764671395172401]
Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space.<n>This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running.
arXiv Detail & Related papers (2025-10-15T09:53:46Z)
Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings [9.763273544617176]
Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning. In this paper, we introduce a simple yet effective framework to address this challenge. Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more.
arXiv Detail & Related papers (2025-03-07T17:46:13Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts [5.397565689903148]
We introduce the ZIP score (Zero-shot Importance of Perturbation score), a versatile metric applicable to both open and closed-source models. We show that while both'step-by-step' and 'think' show high ZIP scores, which one is more influential depends on the model and task.
arXiv Detail & Related papers (2025-02-05T18:04:29Z)
Description Boosting for Zero-Shot Entity and Relation Classification [5.8959034854546815]
We show that Zero-Shot Learning (ZSL) methods are sensitive to provided textual descriptions of entities (or relations) We propose a strategy for generating variations of an initial description and an ensemble method capable of boosting the predictions of zero-shot models through description enhancement.
arXiv Detail & Related papers (2024-06-04T12:09:44Z)
Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning [55.265138447400744]
Statement-Tuning is a technique that models discriminative tasks as a set of finite statements and trains an encoder model to discriminate between the potential statements to determine the label. Experimental results demonstrate that Statement-Tuning achieves competitive performance compared to state-of-the-art LLMs with significantly fewer parameters. The study investigates the impact of several design choices on few-shot and zero-shot generalization, revealing that Statement-Tuning can achieve strong performance with modest training data.
arXiv Detail & Related papers (2024-04-19T14:05:03Z)
The language of prompting: What linguistic properties make a prompt successful? [13.034603322224548]
LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms.
arXiv Detail & Related papers (2023-11-03T15:03:36Z)
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings. We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z)
Investigating the Limitation of CLIP Models: The Worst-Performing Categories [53.360239882501325]
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts. It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts. However, we found that their performance in the worst categories is significantly inferior to the overall performance.
arXiv Detail & Related papers (2023-10-05T05:37:33Z)
Leveraging Codebook Knowledge with NLI and ChatGPT for Zero-Shot Political Relation Classification [10.896514317144499]
This study evaluates zero-shot learning methods that use expert knowledge from existing codebook and a natural language inference (NLI)-based model called ZSP. Experiments reveal ChatGPT's strengths and limitations, and crucially show ZSP's outperformance of dictionary-based methods. Our study underscores the efficacy of leveraging transfer learning and existing domain expertise to enhance research efficiency and scalability.
arXiv Detail & Related papers (2023-08-15T16:41:53Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
EXnet: Efficient In-context Learning for Data-less Text classification [0.0]
We present EXnet, a model specifically designed to perform in-context learning without limitations on the number of examples. We argue that in-context learning is an effective method to increase task accuracy, and providing examples facilitates cross-task generalization. With extensive experiments, we show that even our smallest model (15M parameters) generalizes to several unseen classification tasks and domains.
arXiv Detail & Related papers (2023-05-24T01:40:57Z)
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning) It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario. Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z)
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting. LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available. LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.