TheoremQA: A Theorem-driven Question Answering dataset
- URL: http://arxiv.org/abs/2305.12524v3
- Date: Wed, 6 Dec 2023 03:02:45 GMT
- Title: TheoremQA: A Theorem-driven Question Answering dataset
- Authors: Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu
Xu, Xinyi Wang, Tony Xia
- Abstract summary: GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting.
TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems.
- Score: 100.39878559382694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in
solving fundamental math problems like GSM8K by achieving over 90% accuracy.
However, their capabilities to solve more challenging math problems which
require domain-specific knowledge (i.e. theorem) have yet to be investigated.
In this paper, we introduce TheoremQA, the first theorem-driven
question-answering dataset designed to evaluate AI models' capabilities to
apply theorems to solve challenging science problems. TheoremQA is curated by
domain experts containing 800 high-quality questions covering 350 theorems
(e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem,
Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a
wide spectrum of 16 large language and code models with different prompting
strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that
GPT-4's capabilities to solve these problems are unparalleled, achieving an
accuracy of 51% with Program-of-Thoughts Prompting. All the existing
open-sourced models are below 15%, barely surpassing the random-guess baseline.
Given the diversity and broad coverage of TheoremQA, we believe it can be used
as a better benchmark to evaluate LLMs' capabilities to solve challenging
science problems. The data and code are released in
https://github.com/wenhuchen/TheoremQA.
Related papers
- MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - ATG: Benchmarking Automated Theorem Generation for Generative Language Models [83.93978859348313]
Humans can develop new theorems to explore broader and more complex mathematical results.
Current generative language models (LMs) have achieved significant improvement in automatically proving theorems.
This paper proposes an Automated Theorem Generation benchmark that evaluates whether an agent can automatically generate valuable (and possibly brand new) theorems.
arXiv Detail & Related papers (2024-05-05T02:06:37Z) - Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange [25.419977967846144]
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks.
This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving.
arXiv Detail & Related papers (2024-03-30T12:48:31Z) - MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving.
However, their proficiency in solving mathematical problems remains inadequate.
We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data.
arXiv Detail & Related papers (2024-03-05T11:42:59Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - REFACTOR: Learning to Extract Theorems from Proofs [29.44286369265644]
We show that REFACTOR can extract 19.6% of the theorems that humans would use to write the proofs.
With newly extracted theorems, we show that the existing MetaMath database can beed.
We also demonstrate that the prover trained on the new-theoremed dataset proves more test theorems and outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-26T21:21:30Z) - Learning to Prove Theorems by Learning to Generate Theorems [71.46963489866596]
We learn a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover.
Experiments on real-world tasks demonstrate that synthetic data from our approach improves the theorem prover.
arXiv Detail & Related papers (2020-02-17T16:06:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.