Related papers: Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

URL: http://arxiv.org/abs/2512.03818v1
Date: Wed, 03 Dec 2025 14:07:42 GMT
Title: Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology
Authors: Kylie L. Anglin, Stephanie Milan, Brittney Hernandez, Claudia Ventura,
Abstract summary: We present an empirical framework for optimizing prompt performance for identifying constructs in texts via prompt engineering.<n>We experimentally evaluate five prompting strategies -codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting.<n>Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.

Related papers

Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation [63.97051732013936]
We propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases.<n>In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes.<n>In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes.
arXiv Detail & Related papers (2025-07-14T14:34:15Z)
Automatic Prompt Optimization for Knowledge Graph Construction: Insights from an Empirical Study [8.482898222291206]
Triple (subject-relationobject) extraction from text is the fundamental building block of KG construction.<n>Recent work in NLP tasks (e.g. autonomy generation) uses automatic prompt optimization/engineering to address this challenge.<n>We evaluate different settings by changing (a) the prompting strategy, (b) the LLM being used for prompt optimization and task execution, (c) the number of canonical relations in the schema, (d) the length and diversity of input text, (e) the metric used to drive the prompt optimization, and (f) the dataset being used for training and testing.
arXiv Detail & Related papers (2025-06-24T16:38:49Z)
A Sequential Optimal Learning Approach to Automated Prompt Engineering in Large Language Models [14.483240353801074]
This paper proposes an optimal learning framework for automated prompt engineering.<n>It is designed to sequentially identify effective prompt features while efficiently allocating a limited evaluation budget.<n>Our framework provides a solution to deploying automated prompt engineering in a wider range applications.
arXiv Detail & Related papers (2025-01-07T03:51:10Z)
Enhancing LLM-Based Text Classification in Political Science: Automatic Prompt Optimization and Dynamic Exemplar Selection for Few-Shot Learning [1.6967824074619953]
Large language models (LLMs) offer substantial promise for text classification in political science.<n>Our framework enhances LLM performance through automatic prompt optimization, dynamic exemplar selection, and a consensus mechanism.<n>An open-source Python package (PoliPrompt) is available on GitHub.
arXiv Detail & Related papers (2024-09-02T21:05:31Z)
Task Facet Learning: A Structured Approach to Prompt Optimization [13.423478909210353]
We propose an algorithm that learns multiple facets of a task from a set of training examples.<n>The proposed algorithm, UniPrompt, clusters the input space and uses clustered batches so that each batch likely corresponds to a different facet of the task.<n> Empirical evaluation on multiple datasets and a real-world task shows that prompts generated using shortname obtain higher accuracy than human-tuned prompts and those from state-of-the-art methods.
arXiv Detail & Related papers (2024-06-15T04:54:26Z)
Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars [66.823588073584]
Large language models (LLMs) have shown impressive capabilities in real-world applications. The quality of these exemplars in the prompt greatly impacts performance. Existing methods fail to adequately account for the impact of exemplar ordering on the performance.
arXiv Detail & Related papers (2024-05-25T08:23:05Z)
Efficient Prompting Methods for Large Language Models: A Survey [50.82812214830023]
Efficient Prompting Methods have attracted a wide range of attention.<n>We discuss Automatic Prompt Engineering for different prompt components and Prompt Compression in continuous and discrete spaces.
arXiv Detail & Related papers (2024-04-01T12:19:08Z)
Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases [2.6159111710501506]
We introduce a new method for automatic prompt engineering, using a calibration process that iteratively refines the prompt to the user intent. We demonstrate the effectiveness of our method with respect to strong proprietary models on real-world tasks such as moderation and generation.
arXiv Detail & Related papers (2024-02-05T15:28:43Z)
AutoHint: Automatic Prompt Optimization with Hint Generation [11.737818328656735]
This paper presents AutoHint, a novel framework for automatic prompt engineering and optimization for Large Language Models (LLM) We propose a framework to inherit the merits of both in-context learning and zero-shot learning by incorporating enriched instructions derived from input-output demonstrations to optimize original prompt. We refer to the enrichment as the hint and propose a framework to automatically generate the hint from labeled data.
arXiv Detail & Related papers (2023-07-13T00:49:27Z)
MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task. We conduct experiments on three widely used text classification datasets across four few-shot settings. Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z)
TEMPERA: Test-Time Prompting via Reinforcement Learning [57.48657629588436]
We propose Test-time Prompt Editing using Reinforcement learning (TEMPERA) In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge. Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods.
arXiv Detail & Related papers (2022-11-21T22:38:20Z)
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting. LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available. LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z)
RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning [84.75064077323098]
This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL) RLPrompt is flexibly applicable to different types of LMs, such as masked gibberish (e.g., grammaBERT) and left-to-right models (e.g., GPTs) Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods.
arXiv Detail & Related papers (2022-05-25T07:50:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.