From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation
- URL: http://arxiv.org/abs/2509.11708v1
- Date: Mon, 15 Sep 2025 09:07:52 GMT
- Title: From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation
- Authors: Zhantong Xue, Pingchuan Ma, Zhaoyu Wang, Shuai Wang,
- Abstract summary: Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, blockchain scalability, and secure finance.<n>Unlike mainstream programming, ZK development requires reasoning about finite field arithmetic, constraint systems, and gadgets.<n>We introduce textscZK-Coder, an agentic framework that augments large language models with constraint sketching, guided retrieval, and interactive repair.
- Score: 8.358179599532592
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, blockchain scalability, and secure finance. However, authoring ZK programs remains challenging: unlike mainstream programming, ZK development requires reasoning about finite field arithmetic, constraint systems, and gadgets, making it knowledge-intensive and error-prone. While large language models (LLMs) have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and gadget-level reasoning, remains unexplored. To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities at three levels: language knowledge, gadget competence, and end-to-end program generation. Our evaluation of four state-of-the-art LLMs reveals that models excel at surface-level syntax but struggle with gadget usage and semantic correctness, often yielding incorrect programs. Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair. Experiments on Circom and Noir show substantial gains, with success rates improving from 17.35\% to 83.38\% and from 32.21\% to 90.05\%, respectively. With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a foundation for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance trustworthy computation.
Related papers
- Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming [3.2230833657560503]
We introduce FPEval, a new benchmark of 721 programming tasks across three difficulty levels on three mainstream programming languages: Haskell, Ocaml and Scala.<n>Using this framework, we evaluate state-of-the-art Large Language Models (LLMs) for code generation in functional programming languages and Java.
arXiv Detail & Related papers (2026-01-05T12:33:37Z) - CodeSimpleQA: Scaling Factuality in Code Large Language Models [55.705748501461294]
We present CodeSimpleQA, a comprehensive benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions.<n>We also create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-12-22T14:27:17Z) - TASE: Token Awareness and Structured Evaluation for Multilingual Language Models [8.058965963418785]
TASE is a benchmark designed to evaluate large language models' ability to perceive and reason about token-level information.<n> TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean.<n>We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1.
arXiv Detail & Related papers (2025-08-07T15:11:17Z) - MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights [0.0]
This paper introduces a novel scoring mechanism called the SBC score.<n>It is based on a reverse generation technique that leverages the natural language generation capabilities of Large Language Models.<n>Unlike direct code analysis, our approach reconstructs system requirements from AI-generated code and compares them with the original specifications.
arXiv Detail & Related papers (2025-02-11T01:12:11Z) - Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC)
SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness.
We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge.<n>CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - A Knowledge-Injected Curriculum Pretraining Framework for Question Answering [70.13026036388794]
We propose a general Knowledge-Injected Curriculum Pretraining framework (KICP) to achieve comprehensive KG learning and exploitation for Knowledge-based question answering tasks.
The KI module first injects knowledge into the LM by generating KG-centered pretraining corpus, and generalizes the process into three key steps.
The KA module learns knowledge from the generated corpus with LM equipped with an adapter as well as keeps its original natural language understanding ability.
The CR module follows human reasoning patterns to construct three corpora with increasing difficulties of reasoning, and further trains the LM from easy to hard in a curriculum manner.
arXiv Detail & Related papers (2024-03-11T03:42:03Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Evaluating, Understanding, and Improving Constrained Text Generation for Large Language Models [49.74036826946397]
This study investigates constrained text generation for large language models (LLMs)
Our research mainly focuses on mainstream open-source LLMs, categorizing constraints into lexical, structural, and relation-based types.
Results illuminate LLMs' capacity and deficiency to incorporate constraints and provide insights for future developments in constrained text generation.
arXiv Detail & Related papers (2023-10-25T03:58:49Z) - ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments.
Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences.
Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.