Related papers: Agentified Assessment of Logical Reasoning Agents

Agentified Assessment of Logical Reasoning Agents

URL: http://arxiv.org/abs/2603.02788v1
Date: Tue, 03 Mar 2026 09:26:08 GMT
Title: Agentified Assessment of Logical Reasoning Agents
Authors: Zhiyu Ni, Yifeng Xiao, Zheng Liang,
Abstract summary: Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types.<n>As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO.<n>The auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%)
Score: 3.5548629490839594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).

Related papers

Agentic Code Reasoning [6.246212222645163]
We introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions.<n>We evaluate three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy.<n>These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution.
arXiv Detail & Related papers (2026-03-02T14:17:06Z)
The Necessity of a Unified Framework for LLM-Based Agent Evaluation [46.631678638677386]
General-purpose agents have seen fundamental advancements.<n> evaluating these agents presents unique challenges that distinguish them from static QA benchmarks.<n>We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation.
arXiv Detail & Related papers (2026-02-03T08:18:37Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization [73.0780809974414]
We propose a Reflective Autoformalization method that integrates semantic consistency evaluation into the autoformalization process.<n>This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors.<n>Experiments show that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines.
arXiv Detail & Related papers (2025-10-28T16:22:54Z)
Alita-G: Self-Evolving Generative Agent for Agent Generation [54.49365835457433]
We present ALITA-G, a framework that transforms a general-purpose agent into a domain expert.<n>In this framework, a generalist agent executes a curated suite of target-domain tasks.<n>It attains strong gains while reducing computation costs.
arXiv Detail & Related papers (2025-10-27T17:59:14Z)
What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment [3.5583478152586756]
Agent GPA is an evaluation paradigm based on an agent's operational loop of setting goals, devising plans, and executing actions.<n>The framework includes five evaluation metrics: Goal Fulfillment, Logical Consistency, Execution Efficiency, Plan Quality, and Plan Adherence.
arXiv Detail & Related papers (2025-10-09T22:40:19Z)
Towards Verified Code Reasoning by LLMs [6.973151264926856]
We describe a method to automatically validate the answers provided by a code reasoning agent.<n>The method consists of extracting a formal representation of the agent's response and, subsequently, using formal verification and program analysis tools.
arXiv Detail & Related papers (2025-09-30T17:17:51Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
VerifiAgent: a Unified Verification Agent in Language Model Reasoning [10.227089771963943]
We propose a unified verification agent that integrates two levels of verification: meta-verification and tool-based adaptive verification.<n>VerifiAgent autonomously selects appropriate verification tools based on the reasoning type.<n>It can be effectively applied to inference scaling, achieving better results with fewer generated samples and costs.
arXiv Detail & Related papers (2025-04-01T04:05:03Z)
Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.