Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?
- URL: http://arxiv.org/abs/2410.15512v2
- Date: Wed, 12 Feb 2025 02:06:35 GMT
- Title: Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?
- Authors: Nishant Balepur, Feng Gu, Abhilasha Ravichander, Shi Feng, Jordan Boyd-Graber, Rachel Rudinger,
- Abstract summary: We test reverse question answering (RQA): for an input answer, give a question with that answer.
By finding question and answer types that lead to RQA errors, we suggest improvements for LLM reasoning.
- Score: 24.614521528699093
- License:
- Abstract: Question answering (QA), giving correct answers to questions, is a popular task, but we test reverse question answering (RQA): for an input answer, give a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and checking reasoning consistency. We run 16 LLMs on QA and RQA with trivia questions/answers, revealing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to provide valid multi-hop questions. By finding question and answer types that lead to RQA errors, we suggest improvements for LLM reasoning.
Related papers
- Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above [14.5781090243416]
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing.
We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge.
arXiv Detail & Related papers (2025-02-19T22:11:52Z) - Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically.
The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z) - RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering [61.19126689470398]
Long-form RobustQA (LFRQA) is a new dataset covering 26K queries and large corpora across seven different domains.
We show via experiments that RAG-QA Arena and human judgments on answer quality are highly correlated.
Only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
arXiv Detail & Related papers (2024-07-19T03:02:51Z) - Can multiple-choice questions really be useful in detecting the abilities of LLMs? [15.756543037102256]
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs)
The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy.
We evaluate nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English.
arXiv Detail & Related papers (2024-03-26T14:43:48Z) - PEDANTS: Cheap but Effective and Interpretable Answer Equivalence [10.367359022491181]
We provide rubrics and datasets for evaluating machine QA adopted from the Trivia community.
We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore)
arXiv Detail & Related papers (2024-02-17T01:56:19Z) - Answering Ambiguous Questions with a Database of Questions, Answers, and
Revisions [95.92276099234344]
We present a new state-of-the-art for answering ambiguous questions that exploits a database of unambiguous questions generated from Wikipedia.
Our method improves performance by 15% on recall measures and 10% on measures which evaluate disambiguating questions from predicted outputs.
arXiv Detail & Related papers (2023-08-16T20:23:16Z) - Won't Get Fooled Again: Answering Questions with False Premises [79.8761549830075]
Pre-trained language models (PLMs) have shown unprecedented potential in various fields.
PLMs tend to be easily deceived by tricky questions such as "How many eyes does the sun have?"
We find that the PLMs already possess the knowledge required to rebut such questions.
arXiv Detail & Related papers (2023-07-05T16:09:21Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - Can Question Rewriting Help Conversational Question Answering? [13.484873786389471]
Question rewriting (QR) is a subtask of conversational question answering (CQA)
We investigate a reinforcement learning approach that integrates QR and CQA tasks and does not require corresponding QR datasets for targeted CQA.
We find, however, that the RL method is on par with the end-to-end baseline.
arXiv Detail & Related papers (2022-04-13T08:16:03Z) - Unsupervised Question Decomposition for Question Answering [102.56966847404287]
We propose an algorithm for One-to-N Unsupervised Sequence Sequence (ONUS) that learns to map one hard, multi-hop question to many simpler, single-hop sub-questions.
We show large QA improvements on HotpotQA over a strong baseline on the original, out-of-domain, and multi-hop dev sets.
arXiv Detail & Related papers (2020-02-22T19:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.