Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models
- URL: http://arxiv.org/abs/2505.10543v1
- Date: Thu, 15 May 2025 17:53:47 GMT
- Title: Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models
- Authors: Annie Wong, Thomas Bäck, Aske Plaat, Niki van Stein, Anna V. Kononova,
- Abstract summary: This study systematically evaluates the efficacy of self-reflection, mutation, and planning as prompting techniques to test the adaptive capabilities of agents.<n>We conduct experiments with various open-source language models in dynamic environments.<n>We find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap.
- Score: 0.017476232824732776
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.
Related papers
- Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement [101.77467538102924]
Large reasoning models (LRMs) exhibit overthinking, which hinders efficiency and inflates inference cost.<n>We propose two lightweight methods to enhance LRM efficiency.<n>First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction.<n>Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity.
arXiv Detail & Related papers (2025-06-18T17:18:12Z) - Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement [43.532921045069365]
Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks.<n>Current techniques, such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection to yield substantial and lasting performance gains.<n>We present a novel Reflect and Debate (D&R) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback.
arXiv Detail & Related papers (2025-06-04T03:52:20Z) - Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability [70.4107059502882]
Training language models with rationales augmentation has been shown to be beneficial in many existing works.<n>We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance.
arXiv Detail & Related papers (2025-05-30T02:39:37Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees [3.4289478404209826]
Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection.<n>We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions.
arXiv Detail & Related papers (2024-10-21T08:21:00Z) - On the Modeling Capabilities of Large Language Models for Sequential Decision Making [52.128546842746246]
Large pretrained models are showing increasingly better performance in reasoning and planning tasks.
We evaluate their ability to produce decision-making policies, either directly, by generating actions, or indirectly.
In environments with unfamiliar dynamics, we explore how fine-tuning LLMs with synthetic data can significantly improve their reward modeling capabilities.
arXiv Detail & Related papers (2024-10-08T03:12:57Z) - Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning [53.45295657891099]
This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework.
It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models.
Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
arXiv Detail & Related papers (2024-10-04T11:18:41Z) - Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings.
We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z) - E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models [29.763745375790933]
Large language models (LLMs) are sensitive to prompts, and another synonymous expression or a typo may lead to unexpected results for the model.
We evaluate the ease-of-use of LLMs and construct E-Bench, simulating the actual situation of human use.
arXiv Detail & Related papers (2024-06-16T14:08:30Z) - Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
Large Language Models (LLMs) are often described as instances of foundation models that possess strong generalization obeying scaling laws.<n>We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models claiming strong function.<n>We also observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations.
arXiv Detail & Related papers (2024-06-04T07:43:33Z) - Improving Factuality and Reasoning in Language Models through Multiagent
Debate [95.10641301155232]
We present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer.
Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks.
Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate.
arXiv Detail & Related papers (2023-05-23T17:55:11Z) - Emergent Abilities of Large Language Models [172.08007363384218]
We consider an ability to be emergent if it is not present in smaller models but is present in larger models.
The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
arXiv Detail & Related papers (2022-06-15T17:32:01Z) - Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought.
Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.