DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents
- URL: http://arxiv.org/abs/2311.09805v3
- Date: Fri, 9 Aug 2024 17:57:26 GMT
- Title: DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents
- Authors: Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan,
- Abstract summary: This paper introduces DocMath-Eval, a benchmark to evaluate the numerical reasoning capabilities of LLMs.
We conduct an evaluation of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods.
We find that even the best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts.
- Score: 38.51865513988743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing specialized documents containing both text and tables. We conduct an extensive evaluation of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods, aiming to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that even the current best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe that DocMath-Eval can serve as a valuable benchmark for evaluating LLMs' capabilities in solving challenging numerical reasoning problems within expert domains.
Related papers
- MathConstruct: Challenging LLM Reasoning with Constructive Proofs [0.9320657506524149]
mc is a new benchmark of 126 challenging problems sourced from various math competitions.
mc is suitable for Large Language Models evaluation, as solution correctness can be easily verified.
arXiv Detail & Related papers (2025-02-14T14:44:22Z) - U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs [2.2330469342127577]
We introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials.
It is balanced across six core subjects, with 20% of multimodal problems.
Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions.
Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems.
arXiv Detail & Related papers (2024-12-04T10:44:50Z) - MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs.
We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z) - MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark [29.9945601202065]
We propose MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information.
MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs.
We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models.
arXiv Detail & Related papers (2024-08-14T13:23:43Z) - LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs [8.89259409245068]
Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning.
We present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions.
We find that the most advanced model, GPT-4, achieved a mere 54% accuracy in a multiple-choice scenario.
arXiv Detail & Related papers (2024-06-07T18:21:26Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs.
We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.
This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Evaluating LLMs' Mathematical Reasoning in Financial Document Question
Answering [53.56653281752486]
This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets.
We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps.
We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
arXiv Detail & Related papers (2024-02-17T05:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.