An end-to-end agentic pipeline for smart contract translation and quality evaluation
- URL: http://arxiv.org/abs/2602.13808v1
- Date: Sat, 14 Feb 2026 14:37:59 GMT
- Title: An end-to-end agentic pipeline for smart contract translation and quality evaluation
- Authors: Abhinav Goel, Chaitya Shah, Agostino Capponi, Alfio Gliozzo,
- Abstract summary: We present an end-to-end framework for systematic evaluation of smart contracts generated from natural-language specifications.<n>The system parses contractual text into structured schemas, generates Solidity code, and performs automated quality assessment through compilation and security checks.
- Score: 5.027278762864141
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an end-to-end framework for systematic evaluation of LLM-generated smart contracts from natural-language specifications. The system parses contractual text into structured schemas, generates Solidity code, and performs automated quality assessment through compilation and security checks. Using CrewAI-style agent teams with iterative refinement, the pipeline produces structured artifacts with full provenance metadata. Quality is measured across five dimensions, including functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality aggregated into composite scores. The framework supports paired evaluation against ground-truth implementations, quantifying alignment and identifying systematic error modes such as logic omissions and state transition inconsistencies. This provides a reproducible benchmark for empirical research on smart contract synthesis quality and supports extensions to formal verification and compliance checking.
Related papers
- DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z) - AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment [69.06977852423564]
Image quality assessment (IQA) reflects both the quantification and interpretation of perceptual quality rooted in the human visual system.<n>AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution.<n>To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents.
arXiv Detail & Related papers (2025-09-30T09:37:01Z) - Transparent, Evaluable, and Accessible Data Agents: A Proof-of-Concept Framework [0.0]
This article presents a modular, component-based architecture for developing and evaluating AI agents.<n>The system addresses core challenges in data accessibility by enabling non-technical users to interact with complex data warehouses.<n>A cornerstone of the design is its commitment to transparent decision-making, achieved through a multi-layered reasoning framework.
arXiv Detail & Related papers (2025-09-28T23:54:41Z) - CRACQ: A Multi-Dimensional Approach To Automated Document Assessment [0.0]
CRACQ is a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality.<n>It integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis.
arXiv Detail & Related papers (2025-09-26T17:01:54Z) - AssertCoder: LLM-Based Assertion Generation via Multimodal Specification Extraction [32.14733357890831]
We propose AssertCoder, a novel unified framework that automatically generates high-quality SVAs.<n>AssertCoder employs a modality-sensitive preprocessing to parse heterogeneous specification formats.<n>The framework incorporates a mutation-based evaluation approach to assess assertion quality.
arXiv Detail & Related papers (2025-07-14T14:43:14Z) - AI4Contracts: LLM & RAG-Powered Encoding of Financial Derivative Contracts [1.3060230641655135]
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) are reshaping how AI systems extract and organize information from unstructured text.<n>We introduce CDMizer, a template-driven, LLM, and RAG-based framework for structured text transformation.
arXiv Detail & Related papers (2025-06-01T16:05:00Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - Q-NL Verifier: Leveraging Synthetic Data for Robust Knowledge Graph Question Answering [0.4499833362998489]
We present Q-NL Verifier, an approach to generating high-quality synthetic pairs of queries and NL translations.<n>Our approach relies on large language models to generate semantically precise natural language paraphrases of structured queries.<n>Our experiments with the well-known LC-QuAD 2.0 benchmark show that Q-NL Verifier generalizes well to paraphrases from other models and even human-authored translations.
arXiv Detail & Related papers (2025-03-03T10:28:24Z) - Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.