Asking Again and Again: Exploring LLM Robustness to Repeated Questions
- URL: http://arxiv.org/abs/2412.07923v3
- Date: Wed, 12 Mar 2025 13:48:12 GMT
- Title: Asking Again and Again: Exploring LLM Robustness to Repeated Questions
- Authors: Sagi Shaier, Mario Sanz-Guerrero, Katharina von der Wense,
- Abstract summary: We evaluate five recent large language models (LLMs) on reading comprehension datasets.<n>Our results demonstrate that question repetition can increase models' accuracy by up to $6%$.<n>Across all models, settings, and datasets, we do not find the result statistically significant.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study investigates whether repeating questions within prompts influences the performance of large language models (LLMs). We hypothesize that reiterating a question within a single prompt might enhance the model's focus on key elements of the query. We evaluate five recent LLMs -- including GPT-4o-mini, DeepSeek-V3, and smaller open-source models -- on three reading comprehension datasets under different prompt settings, varying question repetition levels (1, 3, or 5 times per prompt). Our results demonstrate that question repetition can increase models' accuracy by up to $6\%$. However, across all models, settings, and datasets, we do not find the result statistically significant. These findings provide insights into prompt design and LLM behavior, suggesting that repetition alone does not significantly impact output quality.
Related papers
- Memory Is All You Need: Testing How Model Memory Affects LLM Performance in Annotation Tasks [0.0]
We show that allowing the model to retain information about its own previous classifications yields significant performance improvements.
We propose a novel approach that combines model memory and reinforcement learning.
arXiv Detail & Related papers (2025-03-06T16:39:18Z) - DRS: Deep Question Reformulation With Structured Output [114.14122339938697]
Large language models (LLMs) can detect unanswerable questions, but struggle to assist users in reformulating these questions.
We propose DRS: Deep Question Reformulation with Structured Output, a novel zero-shot method aimed at enhancing LLMs ability to assist users in reformulating questions.
We show that DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while also enhancing the performance of open-source models, such as Gemma2-9B, from 26.35% to 56.75%.
arXiv Detail & Related papers (2024-11-27T02:20:44Z) - Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models [8.846200844870767]
We discover an understudied type of undesirable behavior of Large Language Models (LLMs)
We term Verbosity Compensation (VC) as similar to the hesitation behavior of humans under uncertainty.
We propose a simple yet effective cascade algorithm that replaces verbose responses with the other model-generated responses.
arXiv Detail & Related papers (2024-11-12T15:15:20Z) - RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries.
This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z) - AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions.
We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4.
Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z) - See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop.
Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances.
We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z) - Comparison of Large Language Models for Generating Contextually Relevant Questions [6.080820450677854]
GPT-3.5, Llama 2-Chat 13B, and T5 XXL are compared in their ability to create questions from university slide text without fine-tuning.
Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment.
arXiv Detail & Related papers (2024-07-30T06:23:59Z) - Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach [6.549143816134531]
We propose a novel iterative RAG method called ReSP, equipped with a dual-function summarizer.
Experimental results on the multi-hop question-answering HotpotQA and 2WikiMultihopQA demonstrate that our method significantly outperforms the state-of-the-art.
arXiv Detail & Related papers (2024-07-18T02:19:00Z) - CLARINET: Augmenting Language Models to Ask Clarification Questions for Retrieval [52.134133938779776]
We present CLARINET, a system that asks informative clarification questions by choosing questions whose answers would maximize certainty in the correct candidate.
Our approach works by augmenting a large language model (LLM) to condition on a retrieval distribution, finetuning end-to-end to generate the question that would have maximized the rank of the true candidate at each turn.
arXiv Detail & Related papers (2024-04-28T18:21:31Z) - SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs [85.54906813106683]
We propose a simple yet effective framework to enhance open-domain question answering (ODQA) with large language models (LLMs)
SuRe helps LLMs predict more accurate answers for a given question, which are well-supported by the summarized retrieval (SuRe)
Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches.
arXiv Detail & Related papers (2024-04-17T01:15:54Z) - Enhancing Large Language Model Performance To Answer Questions and
Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions.
Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions.
To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z) - Evaluating ChatGPT as a Question Answering System: A Comprehensive
Analysis and Comparison with Existing Models [0.0]
This article scrutinizes ChatGPT as a Question Answering System (QAS)
The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs.
The evaluation highlights hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.
arXiv Detail & Related papers (2023-12-11T08:49:18Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Enhancing In-Context Learning with Answer Feedback for Multi-Span
Question Answering [9.158919909909146]
In this paper, we propose a novel way of employing labeled data such as it informs LLM of some undesired output.
Experiments on three multi-span question answering datasets and a keyphrase extraction dataset show that our new prompting strategy consistently improves LLM's in-context learning performance.
arXiv Detail & Related papers (2023-06-07T15:20:24Z) - Interleaving Retrieval with Chain-of-Thought Reasoning for
Knowledge-Intensive Multi-Step Questions [50.114651561111245]
We propose IRCoT, a new approach for multi-step question answering.
It interleaves retrieval with steps in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT.
arXiv Detail & Related papers (2022-12-20T18:26:34Z) - Recitation-Augmented Language Models [85.30591349383849]
We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks.
Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance.
arXiv Detail & Related papers (2022-10-04T00:49:20Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z) - A Wrong Answer or a Wrong Question? An Intricate Relationship between
Question Reformulation and Answer Selection in Conversational Question
Answering [15.355557454305776]
We show that question rewriting (QR) of the conversational context allows to shed more light on this phenomenon.
We present the results of this analysis on the TREC CAsT and QuAC (CANARD) datasets.
arXiv Detail & Related papers (2020-10-13T06:29:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.