Evaluating the Meta- and Object-Level Reasoning of Large Language Models for Question Answering
- URL: http://arxiv.org/abs/2502.10338v1
- Date: Fri, 14 Feb 2025 17:55:43 GMT
- Title: Evaluating the Meta- and Object-Level Reasoning of Large Language Models for Question Answering
- Authors: Nick Ferguson, Liane Guillou, Alan Bundy, Kwabena Nuamah,
- Abstract summary: Large Language Models (LLMs) excel in natural language tasks but face challenges in Question Answering (QA) tasks requiring complex, multi-step reasoning.<n>We outline the types of reasoning required in some of these tasks, and reframe them in terms of meta-level reasoning.
- Score: 1.9285000127136378
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) excel in natural language tasks but still face challenges in Question Answering (QA) tasks requiring complex, multi-step reasoning. We outline the types of reasoning required in some of these tasks, and reframe them in terms of meta-level reasoning (akin to high-level strategic reasoning or planning) and object-level reasoning (embodied in lower-level tasks such as mathematical reasoning). Franklin, a novel dataset with requirements of meta- and object-level reasoning, is introduced and used along with three other datasets to evaluate four LLMs at question answering tasks requiring multiple steps of reasoning. Results from human annotation studies suggest LLMs demonstrate meta-level reasoning with high frequency, but struggle with object-level reasoning tasks in some of the datasets used. Additionally, evidence suggests that LLMs find the object-level reasoning required for the questions in the Franklin dataset challenging, yet they do exhibit strong performance with respect to the meta-level reasoning requirements.
Related papers
- Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task [0.0]
We take a more structured approach, distinguishing meta-level reasoning from object-level reasoning.<n>We design a novel question answering task based around the values of geopolitical indicators for various countries over various years.<n>We find that LLMs demonstrate good meta-level reasoning on our task, yet are flawed in some aspects of task understanding.
arXiv Detail & Related papers (2026-01-12T16:29:21Z) - MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [59.196131618912005]
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs)<n>Existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities.<n>We introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability.
arXiv Detail & Related papers (2025-06-30T07:14:38Z) - Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z) - QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? [17.854829489402324]
In the real world, queries to large language models are often underspecified, only solvable through acquiring missing information.
We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question.
arXiv Detail & Related papers (2025-03-28T17:58:40Z) - Advancing Reasoning in Large Language Models: Promising Methods and Approaches [0.0]
Large Language Models (LLMs) have succeeded remarkably in various natural language processing (NLP) tasks.<n>Their ability to perform complex reasoning-spanning logical deduction, mathematical problem-solving, commonsense inference, and multi-step reasoning-often falls short of human expectations.<n>This survey provides a comprehensive review of emerging techniques enhancing reasoning in LLMs.
arXiv Detail & Related papers (2025-02-05T23:31:39Z) - Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning [73.2950349728376]
Large language models (LLMs) have demonstrated remarkable success across a wide range of tasks.<n>However, they still encounter challenges in reasoning tasks that require understanding and inferring relationships between pieces of information.<n>This challenge is particularly pronounced in tasks involving multi-step processes, such as logical reasoning and multi-hop question answering.<n>We propose Reasoning with Graphs (RwG) by first constructing explicit graphs from the context.
arXiv Detail & Related papers (2025-01-14T05:18:20Z) - CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - Reasoning with Large Language Models, a Survey [2.831296564800826]
This paper reviews the rapidly expanding field of prompt-based reasoning with LLMs.
Our taxonomy identifies different ways to generate, evaluate, and control multi-step reasoning.
We find that self-improvement, self-reflection, and some meta abilities of the reasoning processes are possible through the judicious use of prompts.
arXiv Detail & Related papers (2024-07-16T08:49:35Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Exploring the Potential of Large Language Models in Computational Argumentation [54.85665903448207]
Large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language.
This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings.
arXiv Detail & Related papers (2023-11-15T15:12:15Z) - DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to Determinacy [76.58614128865652]
We propose DetermLR, a novel perspective that rethinks the reasoning process as an evolution from indeterminacy to determinacy.
First, we categorize known conditions into two types: determinate and indeterminate premises This provides an oveall direction for the reasoning process and guides LLMs in converting indeterminate data into progressively determinate insights.
We automate the storage and extraction of available premises and reasoning paths with reasoning memory, preserving historical reasoning details for subsequent reasoning steps.
arXiv Detail & Related papers (2023-10-28T10:05:51Z) - MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative.
This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm.
Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z) - Improving Large Language Models in Event Relation Logical Prediction [33.88499005859982]
Event relation extraction is a challenging task that demands thorough semantic understanding and rigorous logical reasoning.
In this paper, we conduct an in-depth investigation to systematically explore the capability of LLMs in understanding and applying event relation logic.
Our study reveals that LLMs are not logically consistent reasoners, which results in their suboptimal performance on tasks that need rigorous reasoning.
arXiv Detail & Related papers (2023-10-13T14:53:06Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.