Related papers: LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

URL: http://arxiv.org/abs/2401.00757v3
Date: Tue, 08 Oct 2024 14:34:37 GMT
Title: LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models
Authors: Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, Michael R. Lyu,
Abstract summary: We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
Score: 63.14196038655506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) such as ChatGPT and GPT-4. Despite LLMs' prowess in tasks like writing assistance, code generation, and machine translation, assessing their ability to reason has been challenging. Traditional evaluations often prioritize accuracy on downstream tasks over direct assessments of reasoning processes. LogicAsker addresses this gap by employing a set of atomic reasoning skills grounded in propositional and predicate logic to systematically examine and improve the reasoning prowess of LLMs. Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29\% to 90\% across different models. Moreover, we leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5\%. To our knowledge, this is the first effort to utilize test case outcomes to effectively refine LLMs' formal reasoning capabilities. We make our code, data, and results publicly available (https://github.com/yxwan123/LogicAsker) to facilitate further research and replication of our findings.

Related papers

Training LLMs with LogicReward for Faithful and Rigorous Reasoning [75.30425553246177]
We propose LogicReward, a reward system that guides model training by enforcing step-level logical correctness with a theorem prover.<n>An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks.
arXiv Detail & Related papers (2025-12-20T03:43:02Z)
Teaching Small Language Models to Learn Logic through Meta-Learning [4.923078123348596]
Small models (1.5B-7B) fine-tuned with meta-learning demonstrate strong gains in generalization.<n>These meta-learned models outperform GPT-4o and o3-mini on our syllogistic reasoning task.
arXiv Detail & Related papers (2025-05-20T13:00:48Z)
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models. JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures. Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z)
Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models [9.689096888732642]
We propose Logic-of-Thought (LoT) prompting which employs propositional logic to generate expanded logical information descriptions. LoT boosts the performance of various prompting methods with a striking margin across five logical reasoning tasks.
arXiv Detail & Related papers (2024-09-26T04:59:45Z)
Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games [0.0]
I evaluate the performance of Large Language Models (LLMs) on the Law School Admissions Test (LSAT) I construct a dataset of logic games and their associated metadata, and extensively evaluate LLMs' performance in a Chain-of-Thought prompting setting. I analyze the types of logic games that models perform better or worse on, as well as the types of logical errors I observe from human annotation.
arXiv Detail & Related papers (2024-09-23T21:37:40Z)
Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents [3.5083201638203154]
Logic-Enhanced Language Model Agents (LELMA) is a framework that integrates large language models with formal logic.<n>LeLMA employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity.<n>LeLMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement.
arXiv Detail & Related papers (2024-08-28T18:25:35Z)
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z)
Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding [40.2816930342597]
Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks. But they still struggle with some complicated reasoning tasks including logical reasoning. We propose five concrete tasks from three cognitive dimensions of WHAT, WHY, and HOW in this paper.
arXiv Detail & Related papers (2024-04-04T08:38:03Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training. We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.