Related papers: Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

URL: http://arxiv.org/abs/2407.03624v2
Date: Mon, 26 Aug 2024 08:09:39 GMT
Title: Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks
Authors: Dharunish Yugeswardeenoo, Kevin Zhu, Sean O'Brien,
Abstract summary: We propose a novel prompting strategy called Question Analysis Prompting (QAP) QAP is evaluated on GPT 3.5 Turbo and GPT 4 Turbo on arithmetic datasets GSM8K, AQuA, and SAT and commonsense dataset StrategyQA. QAP consistently ranks among the top-2 prompts on 75% of the tests.
Score: 3.741953084205603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although LLMs have the potential to transform many fields, they still underperform humans in reasoning tasks. Existing methods induce the model to produce step-by-step calculations, but this research explores the question: Does making the LLM analyze the question improve its performance? We propose a novel prompting strategy called Question Analysis Prompting (QAP), in which the model is prompted to explain the question in $n$ words before solving. The value of $n$ influences the length of response generated by the model. QAP is evaluated on GPT 3.5 Turbo and GPT 4 Turbo on arithmetic datasets GSM8K, AQuA, and SAT and commonsense dataset StrategyQA. QAP is compared with other state-of-the-art prompts including Chain-of-Thought (CoT), Plan and Solve Prompting (PS+) and Take A Deep Breath (TADB). QAP outperforms all state-of-the-art prompts on AQuA and SAT datasets on both GPT3.5 and GPT4. QAP consistently ranks among the top-2 prompts on 75\% of the tests. A key factor of QAP performance can be attributed to response length, where detailed responses are beneficial when answering harder questions, but can negatively affect easy questions.

Related papers

Prompting Strategies for Enabling Large Language Models to Infer Causation from Correlation [68.58373854950294]
We focus on causal reasoning and address the task of establishing causal relationships based on correlation information. We introduce a prompting strategy for this problem that breaks the original task into fixed subquestions. We evaluate our approach on an existing causal benchmark, Corr2Cause.
arXiv Detail & Related papers (2024-12-18T15:32:27Z)
Question: How do Large Language Models perform on the Question Answering tasks? Answer: [0.0]
Large Language Models (LLMs) have been showing promising results for various NLP-tasks without the explicit need to be trained for these tasks by using few-shot or zero-shot prompting techniques. We propose a comprehensive performance comparison between smaller fine-tuned models and out-of-the-box instruction-following LLMs on the Stanford Question Answering dataset 2.0 (SQuAD2) Our results show that smaller, fine-tuned models outperform current State-Of-The-Art (SOTA) LLMs on the fine-tuned task, but recent SOTA models are able to close this gap on the out
arXiv Detail & Related papers (2024-12-17T13:19:38Z)
AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z)
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence [10.367359022491181]
We provide rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore)
arXiv Detail & Related papers (2024-02-17T01:56:19Z)
Can We Verify Step by Step for Incorrect Answer Detection? [22.984011562264147]
We introduce a benchmark, R2PE, designed to explore the relationship between reasoning chains and performance in various reasoning tasks. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. We propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin.
arXiv Detail & Related papers (2024-02-16T09:29:50Z)
DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs) We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers. In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z)
Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts [22.669502403623166]
We present Reasoning Question Prompts for VQA tasks, which can further activate the potential of Large Language Models. We generate self-contained questions as reasoning question prompts via an unsupervised question edition module. Each reasoning question prompt clearly indicates the intent of the original question. Then, the candidate answers associated with their confidence scores acting as answer integritys are fed into LLMs.
arXiv Detail & Related papers (2023-11-15T15:40:46Z)
MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering [64.6741991162092]
We present MinPrompt, a minimal data augmentation framework for open-domain question answering. We transform the raw text into a graph structure to build connections between different factual sentences. We then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text. We generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model.
arXiv Detail & Related papers (2023-10-08T04:44:36Z)
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge [155.81786738036578]
Open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs. Pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. We propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge.
arXiv Detail & Related papers (2023-05-30T08:34:13Z)
An Empirical Comparison of LM-based Question and Answer Generation Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context. In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning. Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z)
RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA) We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging. Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z)
PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Finance [96.06505049126345]
We present a new dataset, named PACIFIC. Compared with existing CQA datasets, PACIFIC exhibits three key features: (i) proactivity, (ii) numerical reasoning, and (iii) hybrid context of tables and text. A new task is defined accordingly to study Proactive Conversational Question Answering (PCQA), which combines clarification question generation and CQA. UniPCQA performs multi-task learning over all sub-tasks in PCQA and incorporates a simple ensemble strategy to alleviate the error propagation issue in the multi-task learning by cross-validating top-$k$ sampled Seq2Seq
arXiv Detail & Related papers (2022-10-17T08:06:56Z)
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies [78.68534915690404]
StrategyQA is a benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Overall, StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs.
arXiv Detail & Related papers (2021-01-06T19:14:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.