Related papers: Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

URL: http://arxiv.org/abs/2602.05114v1
Date: Wed, 04 Feb 2026 23:01:20 GMT
Title: Scalable Generation and Validation of Isomorphic Physics Problems with GenAI
Authors: Naiming Liu, Leo Murch, Spencer Moore, Tong Wan, Shashank Sonkar, Richard Baraniuk, Zhongzhou Chen,
Abstract summary: We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI.<n>Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations.<n>For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) and compare against actual student performance.
Score: 2.249733437447874
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $ρ$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.

Related papers

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
QueST: Incentivizing LLMs to Generate Difficult Problems [77.75835742350644]
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems.<n>Existing competitive coding datasets contain only thousands to tens of thousands of problems.<n>We propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning.
arXiv Detail & Related papers (2025-10-20T16:29:53Z)
UniCode: A Framework for Generating High Quality Competitive Coding Problems [41.66698149759178]
UniCode is a novel framework that automatically generates high-quality algorithmic problems alongside robust, contamination-resistant test cases.<n>We show that UniCode is highly challenging and discriminative, with the top-performing model, o4-mini, achieving a pass rate of only 70.3%.
arXiv Detail & Related papers (2025-10-16T05:07:12Z)
Harnessing Consistency for Robust Test-Time LLM Ensemble [88.55393815158608]
CoRE is a plug-and-play technique that harnesses model consistency for robust LLM ensemble.<n> Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens.<n>Model-level consistency models global agreement by promoting model outputs with high self-confidence.
arXiv Detail & Related papers (2025-10-12T04:18:45Z)
MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning [0.0]
Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored.<n>We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments.<n>Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.
arXiv Detail & Related papers (2025-10-07T20:09:29Z)
ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning [51.946959481392064]
Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving.<n>We propose ScaleDiff, a pipeline designed to scale the creation of difficult problems.<n>We show that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models.
arXiv Detail & Related papers (2025-09-25T12:22:44Z)
MAB Optimizer for Estimating Math Question Difficulty via Inverse CV without NLP [3.9566483499208633]
This study introduces the Approach of Passive Measures among Educands (APME), a reinforcement learning-based Multi-Armed Bandit (MAB) framework.<n>By leveraging the inverse coefficient of variation as a risk-adjusted metric, the model provides an explainable and scalable mechanism for adaptive assessment.
arXiv Detail & Related papers (2025-08-26T13:23:31Z)
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
Large Language Models (LLMs) are often described as instances of foundation models that possess strong generalization obeying scaling laws.<n>We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models claiming strong function.<n>We also observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations.
arXiv Detail & Related papers (2024-06-04T07:43:33Z)
Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets [46.19529338280716]
Language models, characterized by their black-box nature, often hallucinate and display sensitivity to input perturbations.<n>We introduce a methodology designed to examine how input perturbations affect language models across various scales.<n>We present three distinct fine-tuning strategies to address robustness against multiple perturbations.
arXiv Detail & Related papers (2023-11-15T02:59:10Z)
A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models [81.15974174627785]
We study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space. Our analysis shows that robustness does not appear to continuously improve as a function of size, but the GPT-3 Davinci models (175B) achieve a dramatic improvement in both robustness and sensitivity compared to all other GPT variants.
arXiv Detail & Related papers (2022-10-21T15:12:37Z)
Learning perturbation sets for robust machine learning [97.6757418136662]
We use a conditional generator that defines the perturbation set over a constrained region of the latent space. We measure the quality of our learned perturbation sets both quantitatively and qualitatively. We leverage our learned perturbation sets to train models which are empirically and certifiably robust to adversarial image corruptions and adversarial lighting variations.
arXiv Detail & Related papers (2020-07-16T16:39:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.