Related papers: SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation

SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation

URL: http://arxiv.org/abs/2602.16671v1
Date: Wed, 18 Feb 2026 18:09:03 GMT
Title: SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation
Authors: Jaid Monwar Chowdhury, Chi-An Fu, Reyhaneh Jabbarvand,
Abstract summary: We introduce a neuro-symbolic, scenario-based framework that bridges the gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management.<n>We evaluate it on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE.
Score: 1.0010193170880752
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated unit test generation for C remains a formidable challenge due to the semantic gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management. While Large Language Models (LLMs) exhibit strong generative capabilities, direct intent-to-code synthesis frequently suffers from the leap-to-code failure mode, where models prematurely emit code without grounding in program structure, constraints, and semantics. This will result in non-compilable tests, hallucinated function signatures, low branch coverage, and semantically irrelevant assertions that cannot properly capture bugs. We introduce SPARC, a neuro-symbolic, scenario-based framework that bridges this gap through four stages: (1) Control Flow Graph (CFG) analysis, (2) an Operation Map that grounds LLM reasoning in validated utility helpers, (3) Path-targeted test synthesis, and (4) an iterative, self-correction validation loop using compiler and runtime feedback. We evaluate SPARC on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE on complex subjects. SPARC retains 94.3% of tests through iterative repair and produces code with significantly higher developer-rated readability and maintainability. By aligning LLM reasoning with program structure, SPARC provides a scalable path for industrial-grade testing of legacy C codebases.

Related papers

CLARC: C/C++ Benchmark for Robust Code Search [2.225731679677886]
We present CLARC, a C/C++ benchmark built from real-world GitHub repositories.<n>Clarc contains 1,245 query-code pairs for evaluation and 5,472 pairs for training.
arXiv Detail & Related papers (2026-03-04T18:57:37Z)
AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z)
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
BRIDGE: Building Representations In Domain Guided Program Verification [67.36686119518441]
BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs.<n>We show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods.
arXiv Detail & Related papers (2025-11-26T06:39:19Z)
Can LLMs Recover Program Semantics? A Systematic Evaluation with Symbolic Execution [1.5377279217726239]
Obfuscation poses a persistent challenge for software engineering tasks such as program comprehension, maintenance, testing, and vulnerability detection.<n>We investigate whether fine-tuned language models can effectively deobfuscate programs and restore analyzability.
arXiv Detail & Related papers (2025-11-24T13:55:20Z)
Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents [0.0]
Existing frameworks often mix cognition, memory, and control in a single prompt, reducing coherence and predictability.<n>The Structured Cognitive Loop (SCL) is proposed as an alternative architecture that separates these functions.<n>SCL achieves an average task success rate of 86.3 percent, compared with 70.5 to 76.8 percent for baselines.
arXiv Detail & Related papers (2025-09-23T17:43:17Z)
Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z)
CHORUS: Zero-shot Hierarchical Retrieval and Orchestration for Generating Linear Programming Code [0.0]
This study explores the efficiency of Large Language Models (LLMs) in generating solver-specific Linear Programming (LP) code.<n>We propose CHORUS, a retrieval-augmented generation framework for synthesizing Gurobi-based LP code from natural language problem statements.<n> Experiments on the NL4-Code benchmark show that CHORUS improves the performance of open-source LLMs by a significant margin compared to baseline and conventional RAG.
arXiv Detail & Related papers (2025-05-02T16:36:57Z)
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis [6.8081984950459]
Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect.<n>We propose CodeARC, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs.<n>We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions.
arXiv Detail & Related papers (2025-03-29T16:50:39Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.