Related papers: VerifiAgent: a Unified Verification Agent in Language Model Reasoning

VerifiAgent: a Unified Verification Agent in Language Model Reasoning

URL: http://arxiv.org/abs/2504.00406v1
Date: Tue, 01 Apr 2025 04:05:03 GMT
Title: VerifiAgent: a Unified Verification Agent in Language Model Reasoning
Authors: Jiuzhou Han, Wray Buntine, Ehsan Shareghi,
Abstract summary: We propose a unified verification agent that integrates two levels of verification: meta-verification and tool-based adaptive verification.<n>VerifiAgent autonomously selects appropriate verification tools based on the reasoning type.<n>It can be effectively applied to inference scaling, achieving better results with fewer generated samples and costs.
Score: 10.227089771963943
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at https://github.com/Jiuzhouh/VerifiAgent

Related papers

TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts. Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification [35.347715518778095]
We study the scaling trends governing sampling-based search.<n>We find that simply scaling up a minimalist implementation of sampling-based search provides a practical inference method.<n>We identify two useful principles for improving self-verification capabilities with test-time compute.
arXiv Detail & Related papers (2025-02-03T21:31:07Z)
Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning [16.824343439487617]
Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement.<n>Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths.<n>As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling.
arXiv Detail & Related papers (2025-02-01T02:08:49Z)
Formal Verification of Deep Neural Networks for Object Detection [1.947473271879451]
Deep neural networks (DNNs) are widely used in real-world applications, yet they remain vulnerable to errors and adversarial attacks. This work extends formal verification to the more complex domain of emphobject detection models.
arXiv Detail & Related papers (2024-07-01T13:47:54Z)
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains [33.46649770312231]
Prompting language models to provide step-by-step answers is a prominent approach for complex reasoning tasks. No fine-grained step-level datasets are available to enable thorough evaluation of such verification methods. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning.
arXiv Detail & Related papers (2024-02-01T12:46:45Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets [101.28658250723804]
This paper experiments with augmenting a transformer model with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action. We observe that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED. We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits.
arXiv Detail & Related papers (2022-05-18T23:52:21Z)
Generating Fact Checking Explanations [52.879658637466605]
A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process. This paper provides the first study of how these explanations can be generated automatically based on available claim context. Our results indicate that optimising both objectives at the same time, rather than training them separately, improves the performance of a fact checking system.
arXiv Detail & Related papers (2020-04-13T05:23:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.