Related papers: HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

URL: http://arxiv.org/abs/2511.18760v1
Date: Mon, 24 Nov 2025 04:50:18 GMT
Title: HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs
Authors: Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli Bai, Xin Shen, Farzan Farnia,
Abstract summary: Hermes is a tool-assisted agent that interleaves informal reasoning with verified proof steps in Lean systems.<n>We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales.
Score: 32.234133057592935
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and enabling efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler in systems such as Lean, but lacks the exploratory freedom of informal problem solving. This mismatch leaves current LLM-based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool-assisted agent that explicitly interleaves informal reasoning with formally verified proof steps in Lean. The framework performs intermediate formal checking to prevent reasoning drift and employs a memory module that maintains proof continuity across long, multi-step reasoning chains, enabling both exploration and verification within a single workflow. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state-of-the-art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing token usage and computational cost compared to reward-based approaches. On difficult datasets such as AIME'25, Hermes achieves up to a 67% accuracy improvement while using 80% fewer total inference FLOPs. The implementation and codebase are publicly available at https://github.com/aziksh-ospanov/HERMES.

Related papers

Training LLMs with LogicReward for Faithful and Rigorous Reasoning [75.30425553246177]
We propose LogicReward, a reward system that guides model training by enforcing step-level logical correctness with a theorem prover.<n>An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks.
arXiv Detail & Related papers (2025-12-20T03:43:02Z)
In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z)
Are Language Models Efficient Reasoners? A Perspective from Logic Programming [109.47572890883248]
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency.<n>We propose a framework for assessing LM reasoning efficiency through the lens of logic programming.
arXiv Detail & Related papers (2025-10-29T15:30:31Z)
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs [54.231935013127206]
Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT)<n>We propose modeling CoT as a certain rule-based process over directed acyclic graphs (DAGs)<n>We introduce logical closeness, a metric that quantifies how well a model's CoT trajectory adheres to the DAG structure.
arXiv Detail & Related papers (2025-10-19T21:05:17Z)
Solving Inequality Proofs with Large Language Models [42.667163027148916]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Automated Theorem Provers Help Improve Large Language Model Reasoning [0.18416014644193066]
We show how accuracy can be improved with a neuro-symbolic architecture. We define a framework of syntactic and semantic error categories. We extend our method with capabilities for automatically correcting syntactic and semantic errors.
arXiv Detail & Related papers (2024-08-07T01:03:56Z)
Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization [45.439933713342256]
Large language models (LLM) are becoming increasingly capable of solving mathematical quantitative reasoning problems. We leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics, they can be prompted to translate into formal Isabelle code. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement.
arXiv Detail & Related papers (2024-03-26T22:01:13Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Neuro-Symbolic Integration Brings Causal and Reliable Reasoning Proofs [95.07757789781213]
Two lines of approaches are adopted for complex reasoning with LLMs.<n>One line of work prompts LLMs with various reasoning structures, while the structural outputs can be naturally regarded as intermediate reasoning steps.<n>The other line of work adopt LLM-free declarative solvers to do the reasoning task, rendering higher reasoning accuracy but lacking interpretability due to the black-box nature of the solvers.<n>We present a simple extension to the latter line of work. Specifically, we showcase that the intermediate search logs generated by Prolog interpreters can be accessed and interpreted into human-readable reasoning.
arXiv Detail & Related papers (2023-11-16T11:26:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.