Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models
- URL: http://arxiv.org/abs/2406.06588v1
- Date: Wed, 5 Jun 2024 12:22:43 GMT
- Title: Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models
- Authors: Flavio Petruzzellis, Alberto Testolin, Alessandro Sperduti,
- Abstract summary: Large Language Models (LLMs) achieve impressive performance in a wide range of tasks.
LLMs show emergent abilities in mathematical reasoning benchmarks.
We evaluate three models of the Llama 2 family on different symbolic reasoning tasks.
- Score: 47.129504708849446
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) achieve impressive performance in a wide range of tasks, even if they are often trained with the only objective of chatting fluently with users. Among other skills, LLMs show emergent abilities in mathematical reasoning benchmarks, which can be elicited with appropriate prompting methods. In this work, we systematically investigate the capabilities and limitations of popular open-source LLMs on different symbolic reasoning tasks. We evaluate three models of the Llama 2 family on two datasets that require solving mathematical formulas of varying degrees of difficulty. We test a generalist LLM (Llama 2 Chat) as well as two fine-tuned versions of Llama 2 (MAmmoTH and MetaMath) specifically designed to tackle mathematical problems. We observe that both increasing the scale of the model and fine-tuning it on relevant tasks lead to significant performance gains. Furthermore, using fine-grained evaluation measures, we find that such performance gains are mostly observed with mathematical formulas of low complexity, which nevertheless often remain challenging even for the largest fine-tuned models.
Related papers
- Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns? [57.80779199039929]
Large Language Models (LLMs) have demonstrated remarkable performance in solving math problems.
This paper introduces a novel benchmark, BeyondX, designed to address these limitations by incorporating problems with multiple unknowns.
Empirical study on BeyondX reveals that the performance of existing LLMs, even those fine-tuned specifically on math tasks, significantly decreases as the number of unknowns increases.
arXiv Detail & Related papers (2024-07-06T17:01:04Z) - MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [58.57255822646756]
This paper introduces MathChat, a benchmark designed to evaluate large language models (LLMs) across a broader spectrum of mathematical tasks.
We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios.
We develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations.
arXiv Detail & Related papers (2024-05-29T18:45:55Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Investigating Symbolic Capabilities of Large Language Models [16.88906206735967]
This study aims to bridge the gap by rigorously evaluating Large Language Models (LLMs) on a series of symbolic tasks.
Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks.
The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases.
arXiv Detail & Related papers (2024-05-21T21:24:34Z) - Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection [2.2724928083094196]
This work looks at the performance of a range of LLMs on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE.
We find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales.
arXiv Detail & Related papers (2024-05-15T11:55:14Z) - From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision [12.023661884821554]
We introduce an innovative two-stage framework that adeptly transfers mathematical Expertise from large to tiny language models.
Our method fully leverages the semantic understanding capabilities during the searching 'problem-equation' pair.
It demonstrates significantly improved performance on the Math23K and Weak12K datasets compared to existing small model methods.
arXiv Detail & Related papers (2024-03-21T13:29:54Z) - Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
Large Language Models (LLMs) have showcased striking results on logical reasoning benchmarks.
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We show a significant performance drop across all the models against the questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z) - Exploring the Potential of Large Language Models in Computational Argumentation [54.85665903448207]
Large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language.
This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings.
arXiv Detail & Related papers (2023-11-15T15:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.