Related papers: U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

URL: http://arxiv.org/abs/2412.03205v3
Date: Tue, 14 Jan 2025 21:58:47 GMT
Title: U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs
Authors: Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga,
Abstract summary: We introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials.<n>It is balanced across six core subjects, with 20% of multimodal problems.<n>Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions.<n>Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems.
Score: 2.2330469342127577
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $\mu$-MATH.

Related papers

LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models [13.713870642186254]
Large language models (LLMs) demonstrate remarkable capabilities across various tasks.<n>Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference.<n>We propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced.
arXiv Detail & Related papers (2025-07-30T03:50:46Z)
On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z)
MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge [11.472720421988184]
We introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions.<n>MSQA distinctively challenges large language models (LLMs) by requiring both precise factual knowledge and multi-step reasoning.
arXiv Detail & Related papers (2025-05-29T20:22:57Z)
Beyond Final Answers: Evaluating Large Language Models for Math Tutoring [0.24197860834245388]
We present two approaches to evaluate the correctness and quality of Large Language Models (LLMs) in math tutoring contexts. The first approach uses an intelligent tutoring system for college algebra as a testbed to assess LLM problem-solving capabilities. The second approach evaluates LLM as tutors rather than problem solvers.
arXiv Detail & Related papers (2025-02-23T15:43:45Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics [1.5716764919736026]
We introduce HARDMath, a dataset featuring challenging applied mathematics problems that require analytical approximation techniques. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts.
arXiv Detail & Related papers (2024-10-13T20:09:41Z)
Evaluating LLMs with Multiple Problems at once [9.173325772800341]
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once.<n>We introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts.<n>Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short.
arXiv Detail & Related papers (2024-06-16T02:52:32Z)
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.02175403852253]
One common use of Large Language Models (LLMs) is performing tasks on scientific topics. Inspired by the way university students are evaluated on such tasks, we propose SciEx - a benchmark consisting of university computer science exam questions. We evaluate the performance of various state-of-the-art LLMs on our new benchmark.
arXiv Detail & Related papers (2024-06-14T21:52:21Z)
Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange [25.419977967846144]
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving.
arXiv Detail & Related papers (2024-03-30T12:48:31Z)
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents [38.51865513988743]
This paper introduces DocMath-Eval, a benchmark to evaluate the numerical reasoning capabilities of LLMs. We conduct an evaluation of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods. We find that even the best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts.
arXiv Detail & Related papers (2023-11-16T11:30:53Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? [15.53530547827583]
We present the Chinese Elementary School Math Word Problems dataset, comprising 1.7k elementary school-level math word problems with detailed annotations. This dataset aims to provide a benchmark tool for assessing the abilities of popular large language models (LLMs) We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $geq$ 60%) across all six elementary school grades.
arXiv Detail & Related papers (2023-06-29T02:19:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.