VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning
- URL: http://arxiv.org/abs/2601.20055v1
- Date: Tue, 27 Jan 2026 20:59:11 GMT
- Title: VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning
- Authors: Vikash Singh, Darion Cassel, Nathaniel Weir, Nick Feng, Sam Bayless,
- Abstract summary: We present a neurosymbolic framework that combines Large Language Models with SMT solvers to produce verification-guided answers.<n>We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking, (2) semantic routing that directs different claim types to appropriate verification strategies, and (3) precise logical error localization via Minimal Correction Subsets.<n>With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
- Score: 4.3414302048068745
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
Related papers
- DiffuRank: Effective Document Reranking with Diffusion Language Models [71.16830004674513]
We propose DiffuRank, a reranking framework built upon diffusion language models (dLLMs)<n>dLLMs support more flexible decoding and generation processes that are not constrained to a left-to-right order.<n>We show dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes.
arXiv Detail & Related papers (2026-02-13T02:18:14Z) - Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification [49.506412445511934]
Large Language Models (LLMs) show remarkable capabilities, yet their next-token prediction creates logical inconsistencies and reward hacking.<n>We introduce a formal logic verification-guided framework that dynamically interleaves formal symbolic verification with the natural language generation process.<n>We operationalize this framework via a novel two-stage training pipeline that synergizes formal logic verification-guided supervised fine-tuning and policy optimization.
arXiv Detail & Related papers (2026-01-30T07:01:25Z) - Memorization, Emergence, and Explaining Reversal Failures: A Controlled Study of Relational Semantics in LLMs [43.414287127130684]
We propose a synthetic framework that generates text from symmetric/inverse triples, trains GPT-style autoregressive models from scratch, and evaluate memorization, logical inference, and in-context generalization.<n>We find that relational semantics emerge with sufficient logic-bearing supervision, even in shallow (2-3 layer) models, and that successful generalization aligns with stable intermediate-layer signals.
arXiv Detail & Related papers (2026-01-06T11:20:38Z) - Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z) - KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering [64.62317305868264]
We present textbfKBQA-R1, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning.<n>Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions.<n>Experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-12-10T17:45:42Z) - Formal that "Floats" High: Formal Verification of Floating Point Arithmetic [0.0]
This paper presents a scalable methodology for verifying floating-point arithmetic using direct RTL-to-RTL model checking against a golden reference model.<n>The methodology is extended with agentic AI-based formal property generation, integrating large language model (LLM)-driven automation with Human-in-the-Loop (HITL) refinement.<n>Results show that direct RTL-to-RTL model checking achieves higher coverage efficiency and requires fewer assertions than standalone verification.
arXiv Detail & Related papers (2025-12-07T14:03:44Z) - Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression [3.3492355863487275]
Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised.<n>We present a controlled evaluation framework that probes reasoning reliability through four stress tests.
arXiv Detail & Related papers (2025-12-06T10:49:50Z) - seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs [1.0519693622157462]
We introduce seqBench, a benchmark for probing sequential reasoning limits in Large Language Models (LLMs)<n>We find that even top-performing models systematically fail on seqBench's structured reasoning tasks despite minimal search complexity.
arXiv Detail & Related papers (2025-09-21T01:32:13Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Explainable Rule Application via Structured Prompting: A Neural-Symbolic Approach [0.0]
Large Language Models (LLMs) excel in complex reasoning tasks but struggle with consistent rule application, exception handling, and explainability.<n>This paper introduces a structured prompting framework that decomposes reasoning into three verifiable steps: entity identification, property extraction, and symbolic rule application.
arXiv Detail & Related papers (2025-06-19T14:14:01Z) - CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning [74.90592233107712]
We propose a Direct-Indirect Reasoning (DIR) method, which considers Direct Reasoning (DR) and Indirect Reasoning (IR) as multiple parallel reasoning paths that are merged to derive the final answer.<n>Our DIR method is simple yet effective and can be straightforwardly integrated with existing variants of CoT methods.
arXiv Detail & Related papers (2024-02-06T03:41:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.