Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning
- URL: http://arxiv.org/abs/2602.00998v1
- Date: Sun, 01 Feb 2026 03:34:30 GMT
- Title: Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning
- Authors: Zhikun Xu, Xiaodong Yu, Ben Zhou, Jiang Liu, Jialian Wu, Ze Wang, Ximeng Sun, Hao Chen, Zicheng Liu,
- Abstract summary: Recent large language models often misapply lemmas, importing conclusions without validating assumptions.<n>We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning.<n>Training and evaluation draw on diverse natural language and formal proof corpora.
- Score: 27.01879432423409
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.
Related papers
- ConvexBench: Can LLMs Recognize Convex Functions? [70.53167848190624]
Convex analysis is a modern branch of mathematics with many applications.<n>As Large Language Models (LLMs) start to automate research-level math and sciences, it is important for LLMs to demonstrate the ability to understand and reason with convexity.<n>We introduce cb, a scalable and mechanically verifiable benchmark for testing textitwhether LLMs can identify the convexity of a symbolic objective under deep functional composition.
arXiv Detail & Related papers (2026-02-01T07:41:17Z) - Fundamental Novel Consistency Theory: $H$-Consistency Bounds [19.493449206135296]
In machine learning, the loss functions optimized during training often differ from the target loss that defines task performance.<n>We present an in-depth study of the target loss estimation error relative to the surrogate loss estimation error.<n>Our analysis leads to $H$-consistency bounds, which are guarantees accounting for the hypothesis set $H$.
arXiv Detail & Related papers (2025-12-28T11:02:20Z) - Hard Negative Sample-Augmented DPO Post-Training for Small Language Models [4.425580048633862]
We propose a lightweight and pragmatic post-training pipeline that targets structured errors under realistic compute budgets.<n>We introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores.<n> Experiments show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO.
arXiv Detail & Related papers (2025-12-17T06:15:52Z) - Mastering Multiple-Expert Routing: Realizable $H$-Consistency and Strong Guarantees for Learning to Defer [30.389055604165222]
This paper introduces novel surrogate loss functions and efficient algorithms with strong theoretical learning guarantees.<n>We address open questions regarding realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for both single-stage and two-stage learning scenarios.<n>We derive new surrogate losses that achieve realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for the two-expert scenario and, under natural assumptions, multiple-expert scenario.
arXiv Detail & Related papers (2025-06-25T17:48:58Z) - Understanding Bias Reinforcement in LLM Agents Debate [41.35531041809235]
Large Language Models (LLMs) solve complex problems using training-free methods like prompt engineering and in-context learning.<n>Self-correction methods such as self-consistency and self-refinement aim to improve reliability.<n>We identify two key limitations: bias reinforcement and lack of perspective diversity.
arXiv Detail & Related papers (2025-03-21T02:51:30Z) - Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
This paper approaches the problem of $textitautoformulation$: the automated creation of solver-ready optimization models from natural language problem descriptions.<n>We identify three core challenges of autoformulation: $textit(1)$ the vast, problem-dependent hypothesis space, and $textit(2)$ efficient and diverse exploration of this space under uncertainty.<n>We present a novel method leveraging $textitLarge Language Models$ with $textitMonte-Carlo Tree Search$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations
arXiv Detail & Related papers (2024-11-03T20:41:38Z) - Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning [54.585428241509234]
We propose R$3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL)
RL employs only outcome supervision to achieve the benefits of process supervision for large language models.
arXiv Detail & Related papers (2024-02-08T16:46:26Z) - OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning [15.59540726867483]
We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness.
Inspired by the findings that $textitoutcome supervision for guided decoding essentially acts as a value model, we propose Outcome-supervised Value Model (OVM)
Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model.
arXiv Detail & Related papers (2023-11-16T09:56:28Z) - How Shift Equivariance Impacts Metric Learning for Instance Segmentation [11.981698445848748]
We show that a standard encoder-decoder network has the capacity to distinguish at most $fdl$ same-looking objects.
We also show that to avoid discontinuities in a tile-and-stitch approach, it is necessary to employ valid convolutions in combination with a training output window size strictly greater than $fl$.
arXiv Detail & Related papers (2021-01-14T19:48:24Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z) - Distributional Robustness and Regularization in Reinforcement Learning [62.23012916708608]
We introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function.
It suggests using regularization as a practical tool for dealing with $textitexternal uncertainty$ in reinforcement learning.
arXiv Detail & Related papers (2020-03-05T19:56:23Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.