GLoRE: Evaluating Logical Reasoning of Large Language Models
- URL: http://arxiv.org/abs/2310.09107v1
- Date: Fri, 13 Oct 2023 13:52:15 GMT
- Title: GLoRE: Evaluating Logical Reasoning of Large Language Models
- Authors: Hanmeng liu, Zhiyang Teng, Ruoxi Ning, Jian Liu, Qiji Zhou, Yue Zhang
- Abstract summary: We introduce GLoRE, a benchmark comprised of 12 datasets that span three different types of tasks.
ChatGPT and GPT-4 show a strong capability of logical reasoning, with GPT-4 surpassing ChatGPT by a large margin.
We propose a self-consistency probing method to enhance the accuracy of ChatGPT and a fine-tuned method to boost the performance of an open LLM.
- Score: 29.914546407784552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, large language models (LLMs), including notable models such as
GPT-4 and burgeoning community models, have showcased significant general
language understanding abilities. However, there has been a scarcity of
attempts to assess the logical reasoning capacities of these LLMs, an essential
facet of natural language understanding. To encourage further investigation in
this area, we introduce GLoRE, a meticulously assembled General Logical
Reasoning Evaluation benchmark comprised of 12 datasets that span three
different types of tasks. Our experimental results show that compared to the
performance of human and supervised fine-tuning, the logical reasoning
capabilities of open LLM models necessitate additional improvement; ChatGPT and
GPT-4 show a strong capability of logical reasoning, with GPT-4 surpassing
ChatGPT by a large margin. We propose a self-consistency probing method to
enhance the accuracy of ChatGPT and a fine-tuned method to boost the
performance of an open LLM. We release the datasets and evaluation programs to
facilitate future research.
Related papers
- Self-Discover: Large Language Models Self-Compose Reasoning Structures [136.48389510481758]
We introduce SELF-DISCOVER, a framework for self-discovering task-intrinsic reasoning structures.
SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks.
We show that the self-discovered reasoning structures are universally applicable across model families.
arXiv Detail & Related papers (2024-02-06T01:13:53Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - GPT-4 Surpassing Human Performance in Linguistic Pragmatics [0.0]
This study investigates the ability of Large Language Models (LLMs) to comprehend and interpret linguistic pragmatics.
Using Grice's communication principles, LLMs and human subjects were evaluated based on their responses to various dialogue-based tasks.
The findings revealed the superior performance and speed of LLMs, particularly GPT4, over human subjects in interpreting pragmatics.
arXiv Detail & Related papers (2023-12-15T05:40:15Z) - A Survey of GPT-3 Family Large Language Models Including ChatGPT and
GPT-4 [4.206175795966694]
Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation.
We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs)
arXiv Detail & Related papers (2023-10-04T16:37:05Z) - Metacognitive Prompting Improves Understanding in Large Language Models [12.112914393948415]
We introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes.
We conduct experiments on four prevalent Large Language Models (LLMs) across ten natural language understanding (NLU) datasets.
MP consistently outperforms existing prompting methods in both general and domain-specific NLU tasks.
arXiv Detail & Related papers (2023-08-10T05:10:17Z) - Pushing the Limits of ChatGPT on NLP Tasks [79.17291002710517]
Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines.
In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors.
We propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks.
arXiv Detail & Related papers (2023-06-16T09:40:05Z) - Chain-of-Thought Hub: A Continuous Effort to Measure Large Language
Models' Reasoning Performance [35.38549845444575]
Chain-of-Thought Hub is an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-05-26T23:46:42Z) - Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training.
We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion.
The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z) - Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.