Related papers: Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

URL: http://arxiv.org/abs/2603.03538v1
Date: Tue, 03 Mar 2026 21:50:14 GMT
Title: Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs
Authors: Maria-Florina Balcan, Avrim Blum, Kiriaki Fragkia, Zhiyuan Li, Dravyansh Sharma,
Abstract summary: We propose an online learning framework for learning chain-of-thought verifiers.<n>We show how our learned verifiers can be used to boost the accuracy of a collection of weak provers.
Score: 34.168578803480116
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models with chain-of-thought generation have demonstrated great potential for producing complex mathematical proofs. However, their reasoning can often go astray, leading to increasing interest in formal and learned verifiers. A major challenge in learning verifiers, especially when their output will be used by the prover, is that this feedback loop may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness (failure in catching errors in a proof) and completeness (flagging correct proofs as wrong) mistakes of the verifier, we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak provers, and enable generation of proofs beyond what they were trained on. With the mild assumption that one of the provers can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong prover with small error and abstention rates.

Related papers

Proof-RM: A Scalable and Generalizable Reward Model for Math Proof [67.53066972145183]
Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR)<n>Many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching.<n>To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required.
arXiv Detail & Related papers (2026-02-02T17:42:53Z)
Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection [42.21636315733425]
Large language models have achieved remarkable success on final-answer mathematical problems.<n>However, the reasoning underlying these solutions is often flawed.<n>We evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance.
arXiv Detail & Related papers (2025-11-17T06:25:35Z)
Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking [54.43083499412643]
Test-time algorithms that combine the generative power of language models with process verifiers offer a promising lever for eliciting new reasoning capabilities.<n>We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors.
arXiv Detail & Related papers (2025-10-03T16:21:14Z)
Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation [82.2288581878096]
We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity.<n>We show that models can be endowed with such dynamic inference pathways without any architectural modifications.
arXiv Detail & Related papers (2025-09-05T16:40:13Z)
Think When You Need: Self-Adaptive Chain-of-Thought Learning [20.22448368125018]
Chain of Thought (CoT) reasoning enhances language models' performance but often leads to inefficient "overthinking" on simple problems.<n>We identify that existing approaches directly penalizing reasoning length fail to account for varying problem complexity.<n>Our approach constructs rewards through length and quality comparisons, guided by theoretical assumptions that jointly enhance solution correctness with conciseness.
arXiv Detail & Related papers (2025-04-04T07:34:01Z)
On the Query Complexity of Verifier-Assisted Language Generation [35.43462431990329]
We develop a framework for reasoning about constrained generation using a pre-trained language model generator oracle.<n>Access to a verifier can render an intractable problem (information-theoretically or computationally) to a tractable one.<n>We show even simple algorithms, like tokenwise rejection sampling, can enjoy significant benefits from access to a verifier.
arXiv Detail & Related papers (2025-02-17T18:46:32Z)
Distribution Learning with Valid Outputs Beyond the Worst-Case [25.788559173418363]
Validity-constrained distribution learning attempts to address this problem by requiring that the learned distribution have a provably small fraction of its mass in invalid parts of space. We show that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement.
arXiv Detail & Related papers (2024-10-21T17:56:09Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Prover-Verifier Games improve legibility of LLM outputs [12.532113917099885]
We study legibility in the context of solving grade-school math problems. We propose a training algorithm inspired by Prover-Verifier Game from Anil et al. We show that legibility training transfers to time-constrained humans tasked with verifying solution correctness.
arXiv Detail & Related papers (2024-07-18T16:58:18Z)
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z)
Understanding and Mitigating Classification Errors Through Interpretable Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors. We propose to discover those patterns of tokens that distinguish correct and erroneous predictions. We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z)
GRACE: Discriminator-Guided Chain-of-Thought Reasoning [75.35436025709049]
We propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE) to steer the decoding process towards producing correct reasoning steps. GRACE employs a discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates.
arXiv Detail & Related papers (2023-05-24T09:16:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.