NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language
Models via Complexity Classes
- URL: http://arxiv.org/abs/2312.14890v4
- Date: Mon, 12 Feb 2024 17:30:25 GMT
- Title: NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language
Models via Complexity Classes
- Authors: Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, Yongfeng Zhang
- Abstract summary: NPHardEval is designed to evaluate the reasoning abilities of Large Language Models (LLMs) across a broad spectrum of 900 questions.
It is meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class.
It is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis.
- Score: 32.154637177467684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Complex reasoning ability is one of the most important features of current
LLMs, which has also been leveraged to play an integral role in complex
decision-making tasks. Therefore, the investigation into the reasoning
capabilities of Large Language Models (LLMs) is critical: numerous benchmarks
have been established to assess the reasoning abilities of LLMs. However,
current benchmarks are inadequate in offering a rigorous evaluation of the full
extent of reasoning abilities that LLMs are capable of achieving. They are also
prone to the risk of overfitting, as these benchmarks, being publicly
accessible and static, allow models to potentially tailor their responses to
specific benchmark metrics, thereby inflating their performance. Addressing
these limitations, our research introduces a new benchmark, named NPHardEval.
This benchmark is designed to evaluate the reasoning abilities of LLMs across a
broad spectrum of 900 algorithmic questions, extending up to the NP-Hard
complexity class. These questions are meticulously chosen to represent a wide
range of complexity class below the NP-hard complexity class, offering a
rigorous measure of the reasoning ability of LLMs. Through this study, we shed
light on the current state of reasoning in LLMs, providing an objective and
rigorous perspective through the comparison of LLMs' performance across complex
classes. Moreover, this benchmark is designed with a dynamic update mechanism,
where the datapoints are refreshed on a monthly basis. Such regular updates
play a crucial role in mitigating the risk of LLMs overfitting to the
benchmark, promoting a more accurate and reliable assessment of their reasoning
capabilities. The benchmark dataset and code of NPHardEval are available at
https://github.com/casmlab/NPHardEval.
Related papers
- DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning [20.066249913943405]
Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors.
We introduce novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios.
Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks.
arXiv Detail & Related papers (2024-06-13T14:31:19Z) - TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability [0.0]
We present a curated collection of challenging statements on sensitive topics for benchmarking called TruthEval.
These statements were curated by hand and contain known truth values.
We perform some initial analyses using this dataset and find several instances of LLMs failing in simple tasks showing their inability to understand simple questions.
arXiv Detail & Related papers (2024-06-04T00:01:35Z) - LLMs for Relational Reasoning: How Far are We? [8.840750655261251]
Large language models (LLMs) have revolutionized many areas by achieving state-of-the-art performance on downstream tasks.
Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems.
arXiv Detail & Related papers (2024-01-17T08:22:52Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames.
It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values.
Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z) - MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative.
This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm.
Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.