Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions
- URL: http://arxiv.org/abs/2502.18435v2
- Date: Thu, 20 Mar 2025 03:25:21 GMT
- Title: Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions
- Authors: Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly,
- Abstract summary: Language models usually use left-to-right (L2R) autoregressive factorization.<n>We investigate whether alternative factorizations of the text distribution could be beneficial in some tasks.
- Score: 51.61404787000037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability and directional conditional entropy. We ablate the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous.
Related papers
- Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
Variation in human annotation (i.e., disagreements) is common in NLP.<n>We evaluate the influence of different reasoning settings on Large Language Model disagreement modeling.<n>Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling.
arXiv Detail & Related papers (2025-06-24T09:49:26Z) - Thinkless: LLM Learns When to Think [57.857534644932194]
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference.<n>We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning.<n>On several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%.
arXiv Detail & Related papers (2025-05-19T17:24:16Z) - Systematic Bias in Large Language Models: Discrepant Response Patterns in Binary vs. Continuous Judgment Tasks [13.704342633541454]
Large Language Models (LLMs) are increasingly used in tasks such as psychological text analysis and decision-making in automated systems.
This study examines how different response format: binary versus continuous, may systematically influence LLMs' judgments.
arXiv Detail & Related papers (2025-04-28T03:20:55Z) - Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition [11.422434149376478]
Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities.
In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks.
Recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features.
arXiv Detail & Related papers (2025-04-04T20:57:36Z) - A Survey of Scaling in Large Language Model Reasoning [62.92861523305361]
We provide a comprehensive examination of scaling in large Language models (LLMs) reasoning.
We analyze scaling in reasoning steps that improves multi-step inference and logical consistency.
We discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement.
arXiv Detail & Related papers (2025-04-02T23:51:27Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.
It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.
Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)
Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - It Helps to Take a Second Opinion: Teaching Smaller LLMs to Deliberate Mutually via Selective Rationale Optimisation [20.784944581469205]
COALITION is a trainable framework that facilitates interaction between two variants of the same SLM.<n>It trains them to generate and refine rationales optimized for the end-task.<n>Our ablation studies reveal that cross-communication between the two variants performs better than using the single model to self-refine the rationales.
arXiv Detail & Related papers (2025-03-04T10:17:29Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
We evaluate the performance of large language models (LLMs) in counterfactual reasoning.<n>We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
arXiv Detail & Related papers (2025-02-16T06:19:37Z) - Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs [11.805264893752154]
We evaluate the reasoning capabilities of two large language models, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle.
Our findings reveal that while both models leverage (a), Flan-T5 shows more resilience to experiments, having less variance than LLaMA 2.
This suggests that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets.
arXiv Detail & Related papers (2024-10-26T15:09:07Z) - Uncovering Factor Level Preferences to Improve Human-Model Alignment [58.50191593880829]
We introduce PROFILE, a framework that uncovers and quantifies the influence of specific factors driving preferences.
ProFILE's factor level analysis explains the 'why' behind human-model alignment and misalignment.
We demonstrate how leveraging factor level insights, including addressing misaligned factors, can improve alignment with human preferences.
arXiv Detail & Related papers (2024-10-09T15:02:34Z) - Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension [9.67774998354062]
Previous research has primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation.<n>We propose a Premise-Oriented Data Augmentation (PODA) framework to generate CoT rationales including analyses for both correct and incorrect options.<n>We also introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples.
arXiv Detail & Related papers (2024-09-22T15:44:43Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - How Likely Do LLMs with CoT Mimic Human Reasoning? [31.86489714330338]
Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from Large Language Models (LLMs)<n>We use causal analysis to understand the relationships between the problem instruction, reasoning, and the answer in LLMs.
arXiv Detail & Related papers (2024-02-25T10:13:04Z) - Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers.
We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z) - IRRGN: An Implicit Relational Reasoning Graph Network for Multi-turn
Response Selection [4.471148909362883]
Implicit Reasoning to Graph Network aims to implicitly extract between utterances, as well as utterances and options.
Model surpasses human performance for the first time on the MuTual dataset.
arXiv Detail & Related papers (2022-12-01T13:17:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.