Related papers: DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

URL: http://arxiv.org/abs/2512.02246v1
Date: Mon, 01 Dec 2025 22:28:39 GMT
Title: DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models
Authors: Olivia Kim,
Abstract summary: This paper introduces DETAIL, a framework for evaluating large language models (LLMs)<n>We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence.<n>Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.

Related papers

The Impact of Role Design in In-Context Learning for Large Language Models [1.3177681589844814]
In-context learning (ICL) enables Large Language Models (LLMs) to generate predictions based on prompts without additional fine-tuning.<n>This study examines the influence of role configurations in zero-shot and few-shot learning scenarios using GPT-3.5 and GPT-4o from OpenAI and Llama2-7b and Llama2-13b from Meta.
arXiv Detail & Related papers (2025-09-27T21:15:30Z)
PromptPrism: A Linguistically-Inspired Taxonomy for Prompts [13.169345040931857]
We introduce PromptPrism, a linguistically-inspired taxonomy that enables prompt analysis across three hierarchical levels.<n>We show the practical utility of PromptPrism by applying it to three applications.
arXiv Detail & Related papers (2025-05-19T01:08:26Z)
Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge [0.0]
This thesis addresses the problem of whether increasing the specificity of vocabulary in prompts improves domain-specific question-answering and reasoning tasks.<n>We developed a synonymization framework to systematically substitute nouns, verbs, and adjectives with varying specificity levels, measuring the impact on four large language models (LLMs)<n>Our results reveal that while generally increasing the specificity of prompts does not have a significant impact, there appears to be a specificity range, across all considered models, where the LLM performs the best.
arXiv Detail & Related papers (2025-05-10T08:40:04Z)
Benchmarking Prompt Sensitivity in Large Language Models [13.986971540998258]
Large language Models (LLMs) are highly sensitive to variations in prompt formulation.<n>This paper introduces a new task, Prompt Sensitivity Prediction, and a dataset designed to investigate the effects of slight prompt variations on LLM performance.
arXiv Detail & Related papers (2025-02-09T23:01:03Z)
MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization [73.7779735046424]
We show that different prompts should be adapted to different Large Language Models (LLM) to enhance their capabilities across various downstream tasks in NLP. We then propose a model-adaptive prompt (MAPO) method that optimize the original prompts for each specific LLM in downstream tasks.
arXiv Detail & Related papers (2024-07-04T18:39:59Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings. We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z)
MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task. We conduct experiments on three widely used text classification datasets across four few-shot settings. Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z)
Empirical Evaluation of ChatGPT on Requirements Information Retrieval Under Zero-Shot Setting [12.733403458944972]
We empirically evaluate ChatGPT's performance on requirements information retrieval tasks. Under zero-shot setting, evaluation results reveal ChatGPT's promising ability to retrieve requirements relevant information.
arXiv Detail & Related papers (2023-04-25T04:09:45Z)
Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z)
Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models. Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity. The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)
CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions. We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD. Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.