Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models
- URL: http://arxiv.org/abs/2407.16221v1
- Date: Tue, 23 Jul 2024 06:56:54 GMT
- Title: Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models
- Authors: Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, Masoud Hashemi,
- Abstract summary: Abstention Ability (AA) is the ability of Large Language Models (LLMs) to refrain from answering questions when they are uncertain or when definitive answer is not possible.
We propose a black-box evaluation methodology to examine and understand the AA of LLMs across a variety of multiple-choice QA tasks.
Our findings reveal that while even state-of-the-art LLMs like GPT-4 struggle with abstention, strategic prompting can significantly enhance this ability.
- Score: 4.377568983107492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) achieve remarkable performance across various NLP tasks, their reliability becomes essential for widespread adoption. This paper focuses on Abstention Ability (AA), a critical yet under explored aspect of reliability - the ability of LLMs to refrain from answering questions when they are uncertain or when definitive answer is not possible, while maintaining question-answering (QA) task performance. While previous works have focused on understanding the recollection abilities of LLMs or their ability to identify imponderable/unanswerable questions, we believe there is a need for an effective AA evaluation method. Therefore, we propose a black-box evaluation methodology to examine and understand the AA of LLMs across a variety of multiple-choice QA tasks. We measure AA by rewarding models for abstaining from answering when their predictions are incorrect or when the questions are inherently unanswerable. We investigate three strategies, Strict Prompting, Verbal Confidence Thresholding, and Chain-of-Thought (CoT), to understand their impact on abstention across different LLMs. Our findings reveal that while even state-of-the-art LLMs like GPT-4 struggle with abstention, strategic prompting such as CoT, can significantly enhance this ability. Furthermore, we demonstrate that improving AA also leads to better overall QA task performance, underscoring the importance of evaluating AA in LLMs.
Related papers
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions [32.871820908561936]
AbstentionBench is a benchmark for holistically evaluating abstention across 20 diverse datasets.<n>We find that reasoning fine-tuning degrades abstention even for math and science domains.
arXiv Detail & Related papers (2025-06-10T17:57:30Z) - Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models [11.379764847748378]
Large language models (LLMs) often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs.<n>This emphasizes the significance of possessing the textbfPremise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises.<n>We introduce the textbfPremise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics.
arXiv Detail & Related papers (2025-05-29T17:49:44Z) - Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization [80.09112808413133]
Mujica is a planner that decomposes questions into acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning.<n>MyGO is a novel reinforcement learning method that replaces traditional policy updates with gradient Likelihood Maximum Estimation.<n> Empirical results across multiple datasets demonstrate the effectiveness of MujicaMyGO in enhancing multi-hop QA performance.
arXiv Detail & Related papers (2025-05-20T18:33:03Z) - Towards Lighter and Robust Evaluation for Retrieval Augmented Generation [1.631189594086952]
We propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination.
We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric.
This score allows us to question decisions' reliability and explore thresholds to develop a new AUC metric.
arXiv Detail & Related papers (2025-03-20T13:58:32Z) - LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.
We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z) - Unc-TTP: A Method for Classifying LLM Uncertainty to Improve In-Context Example Selection [6.813733517894384]
Large Language Models (LLMs) have demonstrated exceptional performance across various downstream tasks.
It is challenging for users to discern whether the responses are generated with certainty or are fabricated to meet user expectations.
We propose a novel Uncertainty Tripartite Testing Paradigm (Unc-TTP) to classify LLM uncertainty.
arXiv Detail & Related papers (2024-08-17T11:33:23Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.
It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection [90.71323430635593]
We propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers.
Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer.
This framework can be seamlessly integrated with existing approaches for superior self-detection.
arXiv Detail & Related papers (2024-03-15T02:38:26Z) - CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering [14.366087533102656]
Question answering (QA) can only make progress if we know if an answer is correct.
Current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments.
arXiv Detail & Related papers (2024-01-24T01:30:25Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - Uncertainty-Driven Action Quality Assessment [67.20617610820857]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores.
We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss.
Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z) - A new interpretable unsupervised anomaly detection method based on
residual explanation [47.187609203210705]
We present RXP, a new interpretability method to deal with the limitations for AE-based AD in large-scale systems.
It stands out for its implementation simplicity, low computational cost and deterministic behavior.
In an experiment using data from a real heavy-haul railway line, the proposed method achieved superior performance compared to SHAP.
arXiv Detail & Related papers (2021-03-14T15:35:45Z) - Active Feature Acquisition with Generative Surrogate Models [11.655069211977464]
In this work, we consider models that perform active feature acquisition (AFA) and query the environment for unobserved features.
Our work reformulates the Markov decision process (MDP) that underlies the AFA problem as a generative modeling task.
We propose learning a generative surrogate model ( GSM) that captures the dependencies among input features to assess potential information gain from acquisitions.
arXiv Detail & Related papers (2020-10-06T02:10:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.