Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification
- URL: http://arxiv.org/abs/2509.23061v1
- Date: Sat, 27 Sep 2025 02:33:08 GMT
- Title: Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification
- Authors: Xu Xu, Xin Li, Xingwei Qu, Jie Fu, Binhang Yuan,
- Abstract summary: DafnyCOMP is a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny.<n>We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks.
- Score: 21.987735608080374
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.
Related papers
- Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities [2.9203730377983654]
Existing benchmarks fail to reflect real-world use or isolate compliance from task success.<n>We introduce MOSAIC, a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints.<n>We show that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position.
arXiv Detail & Related papers (2026-01-26T15:02:15Z) - Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models [4.841487377596519]
This paper investigates whether Large Language Models' code-comprehension performance aligns with traditional human-centric software metrics.<n>We introduce a diagnostic framework that reframes code understanding as a binary input-output consistency task.
arXiv Detail & Related papers (2026-01-19T10:58:24Z) - Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design [8.993770750003673]
Few-Shot Semantic (FSS) models achieve strong performance in segmenting novel classes with minimal labeled examples.<n>This paper introduces the first dedicated method for interpreting matching-based FSS models.<n>Our Affinity Explainer approach extracts attribution maps that highlight which pixels in support images contribute most to query segmentation predictions.
arXiv Detail & Related papers (2025-11-22T19:22:10Z) - Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z) - Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks [12.465309397733249]
Large language models (LLMs) have been widely adopted across diverse software engineering domains.<n>These applications require understanding beyond surface-level code patterns.<n>Existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated.
arXiv Detail & Related papers (2025-07-03T01:35:58Z) - A Large Language Model-Empowered Agent for Reliable and Robust Structural Analysis [14.754785659805869]
Large language models (LLMs) have exhibited remarkable capabilities across diverse open-domain tasks, yet their application in specialized domains such as civil engineering remains largely unexplored.<n>This paper starts bridging this gap by evaluating and enhancing the reliability and robustness of LLMs in structural analysis of beams.<n> Experimental results demonstrate that the agent achieves accuracy exceeding 99.0% on the benchmark dataset, exhibiting reliable and robust performance across diverse conditions.
arXiv Detail & Related papers (2025-06-27T04:16:53Z) - A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z) - LLMDFA: Analyzing Dataflow in Code with Large Language Models [8.92611389987991]
This paper presents LLMDFA, a compilation-free and customizable dataflow analysis framework.
We decompose the problem into several subtasks and introduce a series of novel strategies.
On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35.
arXiv Detail & Related papers (2024-02-16T15:21:35Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.