Related papers: STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

URL: http://arxiv.org/abs/2602.02497v1
Date: Wed, 14 Jan 2026 07:17:12 GMT
Title: STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models
Authors: Xuzhao Li, Xuchen Li, Jian Zhao, Shiyu Hu,
Abstract summary: We propose a diagnostic framework designed to analyze the reasoning capabilities of Large Language Models (LLMs)<n>We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $times$ Cognition" capability space.<n>Our empirical results reveal structural failure patterns in STEM reasoning.
Score: 14.280808299733868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $\times$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.

Related papers

LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
Interpretability Framework for LLMs in Undergraduate Calculus [0.0]
Large Language Models (LLMs) are increasingly being used in education, yet their correctness alone does not capture the quality, reliability, or pedagogical validity of their problem-solving behavior.<n>We introduce a novel interpretability framework for analyzing LLM-generated solutions using undergraduate calculus problems as a representative domain.<n>Our approach combines reasoning flow extraction and decomposing solutions into semantically labeled operations and concepts with prompt ablation analysis to assess input salience and output stability.
arXiv Detail & Related papers (2025-10-19T17:20:36Z)
Fundamentals of Building Autonomous LLM Agents [64.39018305018904]
This paper reviews the architecture and implementation methods of agents powered by large language models (LLMs)<n>The research aims to explore patterns to develop "agentic" LLMs that can automate complex tasks and bridge the performance gap with human capabilities.
arXiv Detail & Related papers (2025-10-10T10:32:39Z)
Generative Large Language Models for Knowledge Representation: A Systematic Review of Concept Map Generation [1.163826615891678]
The rise of generative large language models (LLMs) has opened new opportunities for automating knowledge representation through concept maps.<n>This review systematically synthesizes the emerging body of research on LLM-enabled concept map generation.<n>Findings reveal six major methodological categories: human-in-the-loop systems, weakly supervised learning models, fine-tuned domain-specific LLMs, pre-trained LLMs with prompt engineering, hybrid systems integrating knowledge bases, and modular frameworks combining symbolic and statistical tools.
arXiv Detail & Related papers (2025-09-18T02:36:54Z)
STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples [3.41981716024098]
evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly.<n>We propose the textbfStructured textbfTransition textbfEvaluation textbfMethod (STEM) as a lightweight and interpretable evaluation framework.
arXiv Detail & Related papers (2025-08-16T16:36:43Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes [84.1059652774853]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks.<n>Recent studies have exposed critical limitations in their spatial reasoning capabilities.<n>This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z)
A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making.<n>With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems.<n>We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z)
Ontologies in Design: How Imagining a Tree Reveals Possibilites and Assumptions in Large Language Models [0.4563238570902448]
We argue that value-based analyses are crucial, but under-recognized in analyzing these systems.<n>Proposing a need for a practice-based engagement with pluralism, we offer four orientations for considering orientations in design.
arXiv Detail & Related papers (2025-04-03T21:04:36Z)
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [5.02953506943752]
MM-IQ is a comprehensive evaluation framework that comprises a large-scale training set with 4,776 visual reasoning problems and 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.<n>Our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance.<n>Inspired by the recent surge of large reasoning models, we also release a multimodal reasoning model as the baseline that is trained via reinforcement learning with verifiable reward functions.
arXiv Detail & Related papers (2025-02-02T07:12:03Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.