interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors
- URL: http://arxiv.org/abs/2602.11202v1
- Date: Thu, 05 Feb 2026 08:35:01 GMT
- Title: interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors
- Authors: Vishak K Bhat, Prateek Chanda, Ashmit Khandelwal, Maitreyi Swaroop, Vineeth N. Balasubramanian, Subbarao Kambhampati, Nagarajan Natarajan, Amit Sharma,
- Abstract summary: We present a test-time verification framework, interwhen, that ensures that the output of a reasoning model is valid wrt. a given set of verifiers.<n> Verified reasoning is an important goal in high-stakes scenarios such as deploying agents in the physical world.
- Score: 47.363850513075356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a test-time verification framework, interwhen, that ensures that the output of a reasoning model is valid wrt. a given set of verifiers. Verified reasoning is an important goal in high-stakes scenarios such as deploying agents in the physical world or in domains such as law and finance. However, current techniques either rely on the generate-test paradigm that verifies only after the final answer is produced, or verify partial output through a step-extraction paradigm where the task execution is externally broken down into structured steps. The former is inefficient while the latter artificially restricts a model's problem solving strategies. Instead, we propose to verify a model's reasoning trace as-is, taking full advantage of a model's reasoning capabilities while verifying and steering the model's output only when needed. The key idea is meta-prompting, identifying the verifiable properties that any partial solution should satisfy and then prompting the model to follow a custom format in its trace such that partial outputs can be easily parsed and checked. We consider both self-verification and external verification and find that interwhen provides a useful abstraction to provide feedback and steer reasoning models in each case. Using self-verification, interwhen obtains state-of-the-art results on early stopping reasoning models, without any loss in accuracy. Using external verifiers, interwhen obtains 10 p.p. improvement in accuracy over test-time scaling methods, while ensuring 100% soundness and being 4x more efficient. The code for interwhen is available at https://github.com/microsoft/interwhen
Related papers
- Towards Anytime-Valid Statistical Watermarking [63.02116925616554]
We develop the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference.<n>Our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-19T18:32:26Z) - Preventing the Collapse of Peer Review Requires Verification-First AI [49.995126139461085]
We propose truth-coupling, i.e. how tightly venue scores track latent scientific truth.<n>We formalize two forces that drive a phase transition toward proxy-sovereign evaluation.
arXiv Detail & Related papers (2026-01-23T17:17:32Z) - Are Large Reasoning Models Interruptible? [77.53059044071107]
Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings.<n>We show that even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context.<n>Our analysis further reveals several novel failure modes, including reasoning leakage, panic, and self-doubt.
arXiv Detail & Related papers (2025-10-13T17:59:35Z) - Trust but Verify! A Survey on Verification Design for Test-time Scaling [8.428618801719198]
Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models.<n>Verifiers serve as reward models that help score the candidate outputs from the decoding process.<n>Verifiers could be prompt-based, fine-tuned as a discriminative or generative model.
arXiv Detail & Related papers (2025-08-20T22:27:21Z) - Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [60.151643048803145]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z) - Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens [14.78605805191225]
We investigate how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces-actually influence model performance.<n>We show that despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions.
arXiv Detail & Related papers (2025-05-19T23:29:23Z) - VerifiAgent: a Unified Verification Agent in Language Model Reasoning [10.227089771963943]
We propose a unified verification agent that integrates two levels of verification: meta-verification and tool-based adaptive verification.<n>VerifiAgent autonomously selects appropriate verification tools based on the reasoning type.<n>It can be effectively applied to inference scaling, achieving better results with fewer generated samples and costs.
arXiv Detail & Related papers (2025-04-01T04:05:03Z) - AutoPSV: Automated Process-Supervised Verifier [10.283965168399158]
textbfAutomated textbfProcess-textbfSupervised textbfVerifier (textbftextscAutoPSV)
textscAutoPSV begins by training a verification model on the correctness of final answers.
We experimentally validate that the step-level confidence changes learned by the verification model trained on the final answer correctness can effectively identify errors in the reasoning steps.
arXiv Detail & Related papers (2024-05-27T03:44:24Z) - Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and
Self-Training [23.802602957611676]
We consider a certifiable object pose estimation problem, where -- given a partial point cloud of an object -- the goal is to provide a certificate of correctness for the resulting estimate.
We propose C-3PO, a semantic-keypoint-based pose estimation model, augmented with the two certificates.
arXiv Detail & Related papers (2022-06-22T17:06:39Z) - Tracking the risk of a deployed model and detecting harmful distribution
shifts [105.27463615756733]
In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially.
We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate.
arXiv Detail & Related papers (2021-10-12T17:21:41Z) - Auditing AI models for Verified Deployment under Semantic Specifications [65.12401653917838]
AuditAI bridges the gap between interpretable formal verification and scalability.
We show how AuditAI allows us to obtain controlled variations for verification and certified training while addressing the limitations of verifying using only pixel-space perturbations.
arXiv Detail & Related papers (2021-09-25T22:53:24Z) - Automated Repair of Process Models with Non-Local Constraints Using
State-Based Region Theory [0.19499120576896226]
State-of-the-art process discovery methods construct free-choice process models from event logs.
We propose a novel approach for enhancing free-choice process models by adding non-free-choice constructs discovered a-posteriori via region-based techniques.
arXiv Detail & Related papers (2021-06-26T21:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.