Related papers: Investigating Symbolic Capabilities of Large Language Models

Investigating Symbolic Capabilities of Large Language Models

URL: http://arxiv.org/abs/2405.13209v1
Date: Tue, 21 May 2024 21:24:34 GMT
Title: Investigating Symbolic Capabilities of Large Language Models
Authors: Neisarg Dave, Daniel Kifer, C. Lee Giles, Ankur Mali,
Abstract summary: This study aims to bridge the gap by rigorously evaluating Large Language Models (LLMs) on a series of symbolic tasks. Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks. The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases.
Score: 16.88906206735967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompting techniques have significantly enhanced the capabilities of Large Language Models (LLMs) across various complex tasks, including reasoning, planning, and solving math word problems. However, most research has predominantly focused on language-based reasoning and word problems, often overlooking the potential of LLMs in handling symbol-based calculations and reasoning. This study aims to bridge this gap by rigorously evaluating LLMs on a series of symbolic tasks, such as addition, multiplication, modulus arithmetic, numerical precision, and symbolic counting. Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks. The assessment framework is anchored in Chomsky's Hierarchy, providing a robust measure of the computational abilities of these models. The evaluation employs minimally explained prompts alongside the zero-shot Chain of Thoughts technique, allowing models to navigate the solution process autonomously. The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases. Notably, even the fine-tuned GPT3.5 exhibits only marginal improvements, mirroring the performance trends observed in other models. Across the board, all models demonstrated a limited generalization ability on these symbol-intensive tasks. This research underscores LLMs' challenges with increasing symbolic complexity and highlights the need for specialized training, memory and architectural adjustments to enhance their proficiency in symbol-based reasoning tasks.

Related papers

When Do Symbolic Solvers Enhance Reasoning in Large Language Models? [9.02964480989444]
Large Reasoning Models (LRMs) achieve strong performance on complex reasoning tasks by generating long Chains of Thought (CoTs)<n>A promising direction is the symbolic-solver-integrated approach, which leverages the code generation capabilities of LLMs to translate reasoning tasks into executable code and then solve them with a symbolic solver.<n>Our experimental results show that the symbolic-solver-integrated method only helps when the problem requires limited implicit reasoning but involves an ample search space.
arXiv Detail & Related papers (2025-12-02T22:23:26Z)
mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning [74.97363626515236]
We propose a textbfMultilingual and Scalable Benchmark for textbfSkill-based textbfCommonsense textbfReasoning (textbfmSCoRe)<n>Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities.<n>Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense.
arXiv Detail & Related papers (2025-08-13T18:59:02Z)
MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning [57.42710816140401]
A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps.<n>Existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored.<n>This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.
arXiv Detail & Related papers (2025-07-24T07:03:11Z)
Worst-Case Symbolic Constraints Analysis and Generalisation with Large Language Models [11.612762531670212]
Large language models (LLMs) have been successfully applied to a variety of coding tasks, including code generation, completion, and repair.<n>This paper investigates the capacity of LLMs to reason about worst-case executions in programs through symbolic constraints analysis.
arXiv Detail & Related papers (2025-06-09T19:33:30Z)
Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark [0.0]
The Frame Problem and the Symbol Grounding Problem have historically been viewed as unsolvable within traditional symbolic AI systems.<n>This study investigates whether modern LLMs possess the cognitive capacities required to address these problems.
arXiv Detail & Related papers (2025-06-09T16:12:47Z)
Computational Thinking Reasoning in Large Language Models [69.28428524878885]
Computational Thinking Model (CTM) is a novel framework that incorporates computational thinking paradigms into large language models (LLMs)<n>Live code execution is seamlessly integrated into the reasoning process, allowing CTM to think by computing.<n>CTM outperforms conventional reasoning models and tool-augmented baselines in terms of accuracy, interpretability, and generalizability.
arXiv Detail & Related papers (2025-06-03T09:11:15Z)
Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning [50.53703102032562]
Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks.<n>The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood.
arXiv Detail & Related papers (2025-05-16T08:50:42Z)
Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics [15.695635219034328]
We conduct an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks.<n>Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem.
arXiv Detail & Related papers (2025-04-24T15:34:24Z)
Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning [55.6623318085391]
Recent large language model (LLM) reasoning suffers from limited domain knowledge, susceptibility to hallucinations, and constrained reasoning depth. This paper presents the first investigation into integrating step-wise knowledge graph retrieval with step-wise reasoning. We propose KG-RAR, a framework centered on process-oriented knowledge graph construction, a hierarchical retrieval strategy, and a universal post-retrieval processing and reward model.
arXiv Detail & Related papers (2025-03-03T15:20:41Z)
A Survey on Large Language Models with some Insights on their Capabilities and Limitations [0.3222802562733786]
Large Language Models (LLMs) exhibit remarkable performance across various language-related tasks. LLMs have demonstrated emergent abilities extending beyond their core functions. This paper explores the foundational components, scaling mechanisms, and architectural strategies that drive these capabilities.
arXiv Detail & Related papers (2025-01-03T21:04:49Z)
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z)
Interpreting and Improving Large Language Models in Arithmetic Calculation [72.19753146621429]
Large language models (LLMs) have demonstrated remarkable potential across numerous applications. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. We investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance.
arXiv Detail & Related papers (2024-09-03T07:01:46Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
Can LLM Graph Reasoning Generalize beyond Pattern Memorization? [46.93972334344908]
We evaluate whether large language models (LLMs) can go beyond semantic, numeric, structural, reasoning patterns in the synthetic training data and improve utility on real-world graph-based tasks. We find that while post-training alignment is most promising for real-world tasks, empowering LLM graph reasoning to go beyond pattern remains an open research question.
arXiv Detail & Related papers (2024-06-23T02:59:15Z)
Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models [47.129504708849446]
Large Language Models (LLMs) achieve impressive performance in a wide range of tasks. LLMs show emergent abilities in mathematical reasoning benchmarks. We evaluate three models of the Llama 2 family on different symbolic reasoning tasks.
arXiv Detail & Related papers (2024-06-05T12:22:43Z)
Puzzle Solving using Reasoning of Large Language Models: A Survey [1.9939549451457024]
This survey examines the capabilities of Large Language Models (LLMs) in puzzle solving. Our findings highlight the disparity between LLM capabilities and human-like reasoning. The survey underscores the necessity for novel strategies and richer datasets to advance LLMs' puzzle-solving proficiency.
arXiv Detail & Related papers (2024-02-17T14:19:38Z)
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering [53.56653281752486]
This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets. We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps. We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
arXiv Detail & Related papers (2024-02-17T05:10:18Z)
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation. We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets. We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.