Related papers: SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems -- A Case Study on Ethereum Clients

SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems -- A Case Study on Ethereum Clients

URL: http://arxiv.org/abs/2602.07513v2
Date: Tue, 10 Feb 2026 07:04:48 GMT
Title: SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems -- A Case Study on Ethereum Clients
Authors: Masato Kamba, Akiyoshi Sannai,
Abstract summary: SPECA is a Specification-to-Checklist framework that turns normative requirements into checklists.<n>We instantiate SPECA in an in-the-wild security audit contest for the Fusaka upgrade, covering 11 production clients.<n>Our improved agent, evaluated against the ground truth of a competitive audit, achieved a strict recall of 27.3 percent on high-impact vulnerabilities.
Score: 1.711666249985278
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-implementation systems are increasingly audited against natural-language specifications. Differential testing scales well when implementations disagree, but it provides little signal when all implementations converge on the same incorrect interpretation of an ambiguous requirement. We present SPECA, a Specification-to-Checklist Auditing framework that turns normative requirements into checklists, maps them to implementation locations, and supports cross-implementation reuse. We instantiate SPECA in an in-the-wild security audit contest for the Ethereum Fusaka upgrade, covering 11 production clients. Across 54 submissions, 17 were judged valid by the contest organizers. Cross-implementation checks account for 76.5 percent (13 of 17) of valid findings, suggesting that checklist-derived one-to-many reuse is a practical scaling mechanism in multi-implementation audits. To understand false positives, we manually coded the 37 invalid submissions and find that threat model misalignment explains 56.8 percent (21 of 37): reports that rely on assumptions about trust boundaries or scope that contradict the audit's rules. We detected no High or Medium findings in the V1 deployment; misses concentrated in specification details and implicit assumptions (57.1 percent), timing and concurrency issues (28.6 percent), and external library dependencies (14.3 percent). Our improved agent, evaluated against the ground truth of a competitive audit, achieved a strict recall of 27.3 percent on high-impact vulnerabilities, placing it in the top 4 percent of human auditors and outperforming 49 of 51 contestants on critical issues. These results, though from a single deployment, suggest that early, explicit threat modeling is essential for reducing false positives and focusing agentic auditing effort. The agent-driven process enables expert validation and submission in about 40 minutes on average.

Related papers

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices [17.39388308538324]
This paper introduces ProactiveMobile, a benchmark for proactive mobile agent development.<n>It formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals.<n>The benchmark achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%) in experiments.
arXiv Detail & Related papers (2026-02-25T12:32:37Z)
When Is Enough Not Enough? Illusory Completion in Search Agents [56.98225130959051]
We study whether search agents reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions.<n>We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers.<n>We examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker.
arXiv Detail & Related papers (2026-02-07T13:50:38Z)
Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z)
DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks [10.977990951788422]
DrawingBench is a verification framework for evaluating the trustworthiness of agentic LLMs.<n>Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels.<n>We evaluate four state-of-the-art LLMs across 1,000 tests.
arXiv Detail & Related papers (2025-12-01T01:18:21Z)
Multi-Agent Legal Verifier Systems for Data Transfer Planning [1.286589966480548]
Legal compliance in AI-driven data transfer planning is becoming increasingly critical under stringent privacy regulations.<n>We propose a multi-agent legal verifier that decomposes compliance checking into specialized agents for statutory interpretation, business context evaluation, and risk assessment.
arXiv Detail & Related papers (2025-11-14T03:32:08Z)
One Signature, Multiple Payments: Demystifying and Detecting Signature Replay Vulnerabilities in Smart Contracts [56.94148977064169]
lacking checks on signature usage conditions can lead to repeated verifications, increasing the risk of permission abuse and threatening contract assets.<n>We define this issue as the Signature Replay Vulnerability (SRV)<n>From 1,419 audit reports across 37 blockchain security companies, we identified 108 with detailed SRV descriptions and classified five types of SRVs.
arXiv Detail & Related papers (2025-11-12T09:17:13Z)
SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study [0.0]
SLEAN operates as a simple prompt bridge between LLMs using.txt templates, requiring no deep technical knowledge for deployment.<n>The three-phase protocol formed by independent analysis, cross-critique, and arbitration, filters harmful AI-generated code suggestions.<n>The file-driven, provider-agnostic architecture enables deployment without specialized coding expertise.
arXiv Detail & Related papers (2025-10-11T04:24:04Z)
FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning [62.452350134196934]
FaithCoT-Bench is a unified benchmark for instance-level CoT unfaithfulness detection.<n>Our framework formulates unfaithfulness detection as a discriminative decision problem.<n>FaithCoT-Bench sets a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.
arXiv Detail & Related papers (2025-10-05T05:16:54Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [101.86739402748995]
We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
arXiv Detail & Related papers (2025-07-28T05:13:04Z)
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams [2.897171041611256]
This study introduces CMExamSet, a benchmarking dataset comprising 689 authentic multiple-choice questions from four nationally accredited CM certification exams.<n>Results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively.<n> conceptual misunderstandings are the most common, underscoring the need for enhanced domain-specific reasoning models.
arXiv Detail & Related papers (2025-04-04T18:13:45Z)
Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing [1.4201040196058878]
Large Language Models (LLMs) have transformed task automation and content generation across various domains.<n>We introduce a novel jailbreaking framework that employs distributed prompt processing combined with iterative refinements to bypass safety measures.<n>Tested on 500 malicious prompts across 10 cybersecurity categories, the framework achieves a 73.2% Success Rate (SR) in generating malicious code.
arXiv Detail & Related papers (2025-03-27T15:19:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.