Related papers: It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning

It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning

URL: http://arxiv.org/abs/2311.07532v3
Date: Fri, 7 Jun 2024 23:01:20 GMT
Title: It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning
Authors: Nishant Balepur, Shramay Palta, Rachel Rudinger,
Abstract summary: Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. We propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer.
Score: 16.626335975696243
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work.

Related papers

CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models [56.40065909544213]
Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling.<n>However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency.<n>We identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps.
arXiv Detail & Related papers (2025-05-28T06:24:45Z)
Thinkless: LLM Learns When to Think [57.857534644932194]
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference.<n>We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning.<n>On several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%.
arXiv Detail & Related papers (2025-05-19T17:24:16Z)
AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization [86.56120216550232]
We propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles.
arXiv Detail & Related papers (2025-04-30T14:01:45Z)
Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent [9.439315294704368]
Tree of Thoughts (ToT) methods have shown potential in improving reasoning for complex question-answering tasks. A critical limitation in multi-agent reasoning is the 'Reasoner' agent's shallow exploration of reasoning paths. We introduce a novel approach combining ToT-based Reasoner agents with a Thought Validator agent. Our method demonstrates superior performance compared to existing techniques when evaluated on the GSM8K dataset.
arXiv Detail & Related papers (2024-09-17T19:54:37Z)
Mitigating Misleading Chain-of-Thought Reasoning with Selective Filtering [59.495717939664246]
Large language models have manifested remarkable capabilities by leveraging chain-of-thought (CoT) reasoning techniques to solve intricate questions. We propose a novel approach called the selective filtering reasoner (SelF-Reasoner) that assesses the entailment relationship between the question and the candidate reasoning chain. SelF-Reasoner improves the fine-tuned T5 baseline consistently over the ScienceQA, ECQA, and LastLetter tasks.
arXiv Detail & Related papers (2024-03-28T06:28:35Z)
POE: Process of Elimination for Multiple Choice Reasoning [19.65826015840337]
We argue a similar two-step strategy can make LMs better at multiple choice reasoning tasks. In the first step, POE scores each option, and eliminates seemingly wrong options. In the second step, POE masks these wrong options, and makes the final prediction from the remaining options.
arXiv Detail & Related papers (2023-10-24T07:38:43Z)
Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games [14.063311955315077]
Large language models (LLMs) are effective at answering questions that are clearly asked. When faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively.
arXiv Detail & Related papers (2023-10-02T16:55:37Z)
Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
Tree of Thoughts: Deliberate Problem Solving with Large Language Models [52.31950122881687]
We introduce a new framework for language model inference, Tree of Thoughts (ToT) ToT generalizes over the popular Chain of Thought approach to prompting language models. Our experiments show that ToT significantly enhances language models' problem-solving abilities.
arXiv Detail & Related papers (2023-05-17T23:16:17Z)
Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks. LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes. We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies [78.68534915690404]
StrategyQA is a benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Overall, StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs.
arXiv Detail & Related papers (2021-01-06T19:14:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.