Related papers: Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

URL: http://arxiv.org/abs/2305.14279v4
Date: Fri, 2 Feb 2024 18:37:07 GMT
Title: Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs
Authors: Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho
Abstract summary: We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency and compositional consistency. We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
Score: 78.31625291513589
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

Related papers

SE-Merging: A Self-Enhanced Approach for Dynamic Model Merging [60.83635006372403]
textttSE-Merging is a self-enhanced model merging framework.<n>We show that textttSE-Merging achieves dynamic model merging without additional training.
arXiv Detail & Related papers (2025-06-22T18:38:41Z)
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models [28.20124264650572]
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks.<n>They often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA)<n>This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering.<n>We propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentation with both perturbations and adversarial perturbations.
arXiv Detail & Related papers (2025-05-26T07:31:32Z)
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions [8.069858557211132]
Large Language Models (LLMs) have shown remarkable capabilities across various tasks.<n>Their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction.<n>This paper introduces a comprehensive framework for evaluating and improving LLM response consistency.
arXiv Detail & Related papers (2025-03-28T11:49:56Z)
MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration [63.31211701741323]
We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance.
arXiv Detail & Related papers (2025-03-19T14:46:53Z)
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs. We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches. We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z)
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [56.37421741507468]
Chain-of-Thought (CoT) reasoning has significantly enhanced the performance of large language models (LLMs) We propose a method to identify critical reasoning steps using perplexity as a measure of their importance.
arXiv Detail & Related papers (2025-02-18T20:04:51Z)
MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) [26.475993408532304]
We study the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries. We propose the MM-R$3$ benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs. Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.
arXiv Detail & Related papers (2024-10-07T06:36:55Z)
Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? [65.43882564649721]
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues. We develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. We analyze the potential for improvement in consistency by relative consistency score.
arXiv Detail & Related papers (2024-06-18T17:25:47Z)
An Empirical Investigation into Benchmarking Model Multiplicity for Trustworthy Machine Learning: A Case Study on Image Classification [0.8702432681310401]
This paper offers a one-stop empirical benchmark of multiplicity across various dimensions of model design. We also develop a framework, which we call multiplicity sheets, to benchmark multiplicity in various scenarios. We show that multiplicity persists in deep learning models even after enforcing additional specifications during model selection.
arXiv Detail & Related papers (2023-11-24T22:30:38Z)
Self-Consistency of Large Language Models under Ambiguity [4.141513298907867]
This work presents an evaluation benchmark for self-consistency in cases of under-specification. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67% to 82%, far higher than would be predicted if a model's consistency was random.
arXiv Detail & Related papers (2023-10-20T11:57:56Z)
AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging) It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z)
Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency [127.97467912117652]
Large language models (LLMs) have exhibited remarkable ability in code generation. However, generating the correct solution in a single attempt still remains a challenge. We propose the Multi-Perspective Self-Consistency (MPSC) framework incorporating both inter- and intra-consistency.
arXiv Detail & Related papers (2023-09-29T14:23:26Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)
AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering. The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch. The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level. The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.