Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions
- URL: http://arxiv.org/abs/2401.09395v4
- Date: Thu, 27 Jun 2024 05:23:50 GMT
- Title: Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions
- Authors: Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, Soujanya Poria,
- Abstract summary: Large Language Models (LLMs) have showcased striking results on logical reasoning benchmarks.
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We show a significant performance drop across all the models against the questions.
- Score: 47.83142414018448
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce: (i) a general ontology of perturbations for maths and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, MORE and CORE, respectively, of perturbed maths and coding problems to probe the limits of LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open source the datasets and source codes at: https://github.com/declare-lab/llm_robustness.
Related papers
- What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data [20.31528845718877]
Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities.
This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset.
arXiv Detail & Related papers (2024-06-26T13:02:35Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models [47.129504708849446]
Large Language Models (LLMs) achieve impressive performance in a wide range of tasks.
LLMs show emergent abilities in mathematical reasoning benchmarks.
We evaluate three models of the Llama 2 family on different symbolic reasoning tasks.
arXiv Detail & Related papers (2024-06-05T12:22:43Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by
Dissociating Language and Cognition [57.747888532651]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Learning To Teach Large Language Models Logical Reasoning [33.88499005859982]
Large language models (LLMs) have gained enormous attention from both academia and industry.
However, current LLMs still output unreliable content in practical reasoning tasks due to their inherent issues.
arXiv Detail & Related papers (2023-10-13T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.