Related papers: SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

URL: http://arxiv.org/abs/2505.16646v4
Date: Mon, 13 Oct 2025 07:00:07 GMT
Title: SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving
Authors: Yujie Hou, Ting Zhang, Mei Wang, Xuetao Ma, Hua Huang,
Abstract summary: Large Language Models (LLMs) have achieved remarkable results on a variety of mathematical benchmarks.<n>Common evaluation methods, which focus on the either the final answer or the reasoning process, fail to assess the entire problem-solving procedure.<n>Our findings reveal genuine weaknesses in current LLMs and motivate a new metric, the All-Pass Score, to better capture true problem-solving capabilities.
Score: 24.689620248781214
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Common evaluation methods, which focus on the either the final answer or the reasoning process, fail to assess the entire problem-solving procedure. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework, together with its corresponding benchmark, SMART-Bench. SMART decomposes the entire problem solving process into four distinct cognitive dimensions: Understanding, Reasoning, Arithmetic, and Reflection \& Refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings reveal genuine weaknesses in current LLMs and motivate a new metric, the All-Pass Score, to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.

Related papers

Benchmarking at the Edge of Comprehension [38.43582342860192]
If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake.<n>We propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible.<n>Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims.
arXiv Detail & Related papers (2026-02-15T20:51:29Z)
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models [15.929002709503921]
We aim to evaluate a fundamental yet underexplored intelligence: association.<n> MM-OPERA is a systematic benchmark with 11,497 instances across two open-ended tasks.<n>It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning.
arXiv Detail & Related papers (2025-10-30T18:49:06Z)
Multi-Agent Evolve: LLM Self-Improve through Co-evolution [53.00458074754831]
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs)<n>Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data.<n>We propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A.
arXiv Detail & Related papers (2025-10-27T17:58:02Z)
LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models [13.713870642186254]
Large language models (LLMs) demonstrate remarkable capabilities across various tasks.<n>Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference.<n>We propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced.
arXiv Detail & Related papers (2025-07-30T03:50:46Z)
INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems [5.177249919642388]
INTEGRALBENCH is a focused benchmark designed to evaluate Large Language Model (LLM) performance on definite integral problems.<n>Our evaluation of nine state-of-the-art LLMs reveals significant performance gaps and strong correlations between problem difficulty and model accuracy.
arXiv Detail & Related papers (2025-07-22T08:44:36Z)
ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models [70.33764118171463]
Large Language Models (LLMs) tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability.<n>We develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems.<n>LLMs fail to directly identify unsolvable problems and always generate fabricated responses.
arXiv Detail & Related papers (2025-07-03T19:19:44Z)
Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge [0.0]
Existing studies treat memorization and self-knowledge deficits in LLMs as separate issues.<n>We utilize a novel framework to ascertain if LLMs genuinely learn reasoning patterns from training data.<n>Our analysis shows a noteworthy problem in generalization: LLMs draw confidence from memorized solutions to infer a higher self-knowledge.
arXiv Detail & Related papers (2025-06-23T18:01:16Z)
Solving Inequality Proofs with Large Language Models [46.71658812761115]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z)
Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [106.17986469245302]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z)
Self-Memory Alignment: Mitigating Factual Hallucinations with Generalized Improvement [37.59724553583446]
Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in factual hallucinations.<n>We introduce self-memory alignment (SMA), which fine-tunes the model on self-generated responses to precise and simple factual questions.<n>Extensive experiments show that SMA significantly improves LLMs' overall performance, with consistent enhancement across various benchmarks concerning factuality, as well as helpfulness and comprehensive skills.
arXiv Detail & Related papers (2025-02-26T13:34:52Z)
Emergence of Self-Identity in AI: A Mathematical Framework and Empirical Study with Generative Large Language Models [4.036530158875673]
This paper introduces a mathematical framework for defining and quantifying self-identity in AI systems.<n>Our framework posits that self-identity emerges from two mathematically quantifiable conditions.<n>The implications of our study are immediately relevant to the fields of humanoid robotics and autonomous systems.
arXiv Detail & Related papers (2024-11-27T17:23:47Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education [2.872215065231376]
This paper introduces MalAlgoQA, a dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models. At the heart of MalAlgoQA are malgorithms'' - rationales behind incorrect answer choices that represent flawed yet logically coherent reasoning paths.
arXiv Detail & Related papers (2024-07-01T03:39:13Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.