Related papers: Multidimensional Consistency Improves Reasoning in Language Models

Multidimensional Consistency Improves Reasoning in Language Models

URL: http://arxiv.org/abs/2503.02670v1
Date: Tue, 04 Mar 2025 14:41:05 GMT
Title: Multidimensional Consistency Improves Reasoning in Language Models
Authors: Huiyuan Lai, Xiao Zhang, Malvina Nissim,
Abstract summary: We introduce a framework for testing models for answer consistency across multiple input variations.<n>We induce variations in (i) order of shots in prompt, (ii) problem phrasing, and (iii) languages used.<n>Our framework consistently enhances mathematical reasoning performance on both monolingual dataset GSM8K and multilingual dataset MGSM, especially for smaller models.
Score: 21.989335720239467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large language models (LLMs) have proved able to address some complex reasoning tasks, we also know that they are highly sensitive to input variation, which can lead to different solution paths and final answers. Answer consistency across input variations can thus be taken as a sign of stronger confidence. Leveraging this insight, we introduce a framework, {\em Multidimensional Reasoning Consistency} where, focusing on math problems, models are systematically pushed to diversify solution paths towards a final answer, thereby testing them for answer consistency across multiple input variations. We induce variations in (i) order of shots in prompt, (ii) problem phrasing, and (iii) languages used. Extensive experiments on a large range of open-source state-of-the-art LLMs of various sizes show that reasoning consistency differs by variation dimension, and that by aggregating consistency across dimensions, our framework consistently enhances mathematical reasoning performance on both monolingual dataset GSM8K and multilingual dataset MGSM, especially for smaller models.

Related papers

Enhancing Multi-hop Reasoning in Vision-Language Models via Self-Distillation with Multi-Prompt Ensembling [11.462550020102935]
Multi-modal large language models can leverage chain-of-thought prompting for zero or few-shot learning.<n>Similar prompting strategies are less effective for multi-modal LLMs due to modality gaps and task complexity.<n>We propose a self-distillation framework such that the model can improve itself without any annotated data.
arXiv Detail & Related papers (2025-03-03T17:24:42Z)
Bactrainus: Optimizing Large Language Models for Multi-hop Complex Question Answering Tasks [5.439505575097552]
We evaluate the ability of large language models in performing domain-specific tasks using the HotpotQA dataset.<n>This task serves as a challenging benchmark for assessing the language comprehension capabilities of these models.<n>The results of the study show that the integration of large language models with these techniques can lead to up to a 4% improvement in F1 score for finding answers.
arXiv Detail & Related papers (2025-01-10T18:44:06Z)
Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs) We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs. We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z)
HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities. It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z)
Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? [65.43882564649721]
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues. We develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. We analyze the potential for improvement in consistency by relative consistency score.
arXiv Detail & Related papers (2024-06-18T17:25:47Z)
Brainstorming Brings Power to Large Language Models of Knowledge Reasoning [17.14501985068287]
Large Language Models (LLMs) have demonstrated amazing capabilities in language generation, text comprehension, and knowledge reasoning. Recent studies have further improved the model's reasoning ability on a wide range of tasks by introducing multi-model collaboration. We propose the multi-model brainstorming based on prompt. It incorporates different models into a group for brainstorming, and after multiple rounds of reasoning elaboration and re-inference, a consensus answer is reached.
arXiv Detail & Related papers (2024-06-02T14:47:14Z)
Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z)
On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches. We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z)
Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models [58.41943058963672]
We propose a new inference framework called Recursion of Thought (RoT) RoT introduces several special tokens that the models can output to trigger context-related operations. Experiments with multiple architectures including GPT-3 show that RoT dramatically improves LMs' inference capability to solve problems.
arXiv Detail & Related papers (2023-06-12T06:34:16Z)
Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs [78.31625291513589]
We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency and compositional consistency. We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
arXiv Detail & Related papers (2023-05-23T17:25:59Z)
A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models [81.15974174627785]
We study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space. Our analysis shows that robustness does not appear to continuously improve as a function of size, but the GPT-3 Davinci models (175B) achieve a dramatic improvement in both robustness and sensitivity compared to all other GPT variants.
arXiv Detail & Related papers (2022-10-21T15:12:37Z)
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension [21.000045864213327]
referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning. How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task. We propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability.
arXiv Detail & Related papers (2022-04-21T08:32:47Z)
Semantic Sentence Composition Reasoning for Multi-Hop Question Answering [1.773120658816994]
We present a semantic sentence composition reasoning approach for a multi-hop question answering task. With the combination of factual sentences and multi-stage semantic retrieval, our approach can provide more comprehensive contextual information for model training and reasoning. Experimental results demonstrate our model is able to incorporate existing pre-trained language models and outperform the existing SOTA method on the QASC task with an improvement of about 9%.
arXiv Detail & Related papers (2022-03-01T00:35:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.