Related papers: Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset

Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset

URL: http://arxiv.org/abs/2506.20729v1
Date: Wed, 25 Jun 2025 18:00:18 GMT
Title: Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset
Authors: Zhiqi Gao, Tianyi Li, Yurii Kvasiuk, Sai Chaitanya Tadepalli, Maja Rudolph, Daniel J. H. Chung, Frederic Sala, Moritz Münchmeyer,
Abstract summary: We evaluate a range of common test-time scaling methods on the TPBench physics dataset.<n>We develop a novel, symbolic weak-verifier framework to improve parallel scaling results.<n>Our findings highlight the power of step-wise symbolic verification for tackling complex scientific problems.
Score: 13.530403536762064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown strong capabilities in complex reasoning, and test-time scaling techniques can enhance their performance with comparably low cost. Many of these methods have been developed and evaluated on mathematical reasoning benchmarks such as AIME. This paper investigates whether the lessons learned from these benchmarks generalize to the domain of advanced theoretical physics. We evaluate a range of common test-time scaling methods on the TPBench physics dataset and compare their effectiveness with results on AIME. To better leverage the structure of physics problems, we develop a novel, symbolic weak-verifier framework to improve parallel scaling results. Our empirical results demonstrate that this method significantly outperforms existing test-time scaling approaches on TPBench. We also evaluate our method on AIME, confirming its effectiveness in solving advanced mathematical problems. Our findings highlight the power of step-wise symbolic verification for tackling complex scientific problems.

Related papers

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z)
PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning [57.868248683256574]
PRISM-Physics is a process-level evaluation framework and benchmark for complex physics reasoning problems.<n> Solutions are represented as directed acyclic graphs (DAGs) of formulas.<n>Results show that our evaluation framework is aligned with human experts' scoring.
arXiv Detail & Related papers (2025-10-03T17:09:03Z)
An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems [48.10132234701036]
We introduce a systematic framework to assess LLMs' mathematical-reasoning robustness.<n>We stress-test them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation.<n>Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset.
arXiv Detail & Related papers (2025-08-12T10:40:33Z)
Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning [81.50681925980135]
We propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps.<n>It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making.<n>Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results.
arXiv Detail & Related papers (2025-05-23T12:42:50Z)
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z)
Iterative Deepening Sampling as Efficient Test-Time Scaling [27.807695570974644]
Recent reasoning models, such as OpenAI's O1 series, have demonstrated exceptional performance on complex reasoning tasks.<n>We propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples.
arXiv Detail & Related papers (2025-02-08T04:39:51Z)
Deep Plug-and-Play HIO Approach for Phase Retrieval [0.0]
In the phase retrieval problem, the aim is the recovery of an unknown image from intensity-only measurements.<n>Recent learning-based approaches have emerged as powerful alternatives to the analytical methods for several inverse problems.<n>A novel plug-and-play approach that exploits learning-based prior and efficient update steps has been presented.
arXiv Detail & Related papers (2024-11-28T07:36:29Z)
See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition [56.87609859444084]
parameter-efficient fine-tuning (PEFT) focuses on optimizing a select subset of parameters while keeping the rest fixed, significantly lowering computational and storage overheads.<n>We take the first step to unify all approaches by dissecting them from a decomposition perspective.<n>We introduce two novel PEFT methods alongside a simple yet effective framework designed to enhance the performance of PEFT techniques across various applications.
arXiv Detail & Related papers (2024-07-07T15:44:42Z)
Discovering physical laws with parallel combinatorial tree search [57.05912962368898]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.<n>Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade.<n>We introduce a parallel tree search (PCTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z)
Dynamical Isometry based Rigorous Fair Neural Architecture Search [2.7850218655824803]
We propose a novel neural architecture search algorithm based on dynamical isometry. We prove that our module selection strategy is rigorous fair by estimating the generalization error of all modules with well-conditioned Jacobian.
arXiv Detail & Related papers (2023-07-05T13:01:21Z)
Theoretical Analysis on the Efficiency of Interleaved Comparisons [3.654658106140114]
This study presents a theoretical analysis on the efficiency of interleaving, an efficient online evaluation method for rankings. We begin by designing a simple interleaving method similar to ordinary interleaving methods. We explore a condition under which the interleaving method is more efficient than A/B testing and find that this is the case when users leave the ranking depending on the item's relevance.
arXiv Detail & Related papers (2023-05-31T03:04:29Z)
Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks. We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z)
Amortized Implicit Differentiation for Stochastic Bilevel Optimization [53.12363770169761]
We study a class of algorithms for solving bilevel optimization problems in both deterministic and deterministic settings. We exploit a warm-start strategy to amortize the estimation of the exact gradient. By using this framework, our analysis shows these algorithms to match the computational complexity of methods that have access to an unbiased estimate of the gradient.
arXiv Detail & Related papers (2021-11-29T15:10:09Z)
Real-Time Model Calibration with Deep Reinforcement Learning [4.707841918805165]
We propose a novel framework for inference of model parameters based on reinforcement learning. The proposed methodology is demonstrated and evaluated on two model-based diagnostics test cases.
arXiv Detail & Related papers (2020-06-07T00:11:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.