SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation
- URL: http://arxiv.org/abs/2602.16671v1
- Date: Wed, 18 Feb 2026 18:09:03 GMT
- Title: SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation
- Authors: Jaid Monwar Chowdhury, Chi-An Fu, Reyhaneh Jabbarvand,
- Abstract summary: We introduce a neuro-symbolic, scenario-based framework that bridges the gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management.<n>We evaluate it on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE.
- Score: 1.0010193170880752
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated unit test generation for C remains a formidable challenge due to the semantic gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management. While Large Language Models (LLMs) exhibit strong generative capabilities, direct intent-to-code synthesis frequently suffers from the leap-to-code failure mode, where models prematurely emit code without grounding in program structure, constraints, and semantics. This will result in non-compilable tests, hallucinated function signatures, low branch coverage, and semantically irrelevant assertions that cannot properly capture bugs. We introduce SPARC, a neuro-symbolic, scenario-based framework that bridges this gap through four stages: (1) Control Flow Graph (CFG) analysis, (2) an Operation Map that grounds LLM reasoning in validated utility helpers, (3) Path-targeted test synthesis, and (4) an iterative, self-correction validation loop using compiler and runtime feedback. We evaluate SPARC on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE on complex subjects. SPARC retains 94.3% of tests through iterative repair and produces code with significantly higher developer-rated readability and maintainability. By aligning LLM reasoning with program structure, SPARC provides a scalable path for industrial-grade testing of legacy C codebases.
Related papers
- CLARC: C/C++ Benchmark for Robust Code Search [2.225731679677886]
We present CLARC, a C/C++ benchmark built from real-world GitHub repositories.<n>Clarc contains 1,245 query-code pairs for evaluation and 5,472 pairs for training.
arXiv Detail & Related papers (2026-03-04T18:57:37Z) - AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z) - RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - BRIDGE: Building Representations In Domain Guided Program Verification [67.36686119518441]
BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs.<n>We show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods.
arXiv Detail & Related papers (2025-11-26T06:39:19Z) - Can LLMs Recover Program Semantics? A Systematic Evaluation with Symbolic Execution [1.5377279217726239]
Obfuscation poses a persistent challenge for software engineering tasks such as program comprehension, maintenance, testing, and vulnerability detection.<n>We investigate whether fine-tuned language models can effectively deobfuscate programs and restore analyzability.
arXiv Detail & Related papers (2025-11-24T13:55:20Z) - Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents [0.0]
Existing frameworks often mix cognition, memory, and control in a single prompt, reducing coherence and predictability.<n>The Structured Cognitive Loop (SCL) is proposed as an alternative architecture that separates these functions.<n>SCL achieves an average task success rate of 86.3 percent, compared with 70.5 to 76.8 percent for baselines.
arXiv Detail & Related papers (2025-09-23T17:43:17Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - CHORUS: Zero-shot Hierarchical Retrieval and Orchestration for Generating Linear Programming Code [0.0]
This study explores the efficiency of Large Language Models (LLMs) in generating solver-specific Linear Programming (LP) code.<n>We propose CHORUS, a retrieval-augmented generation framework for synthesizing Gurobi-based LP code from natural language problem statements.<n> Experiments on the NL4-Code benchmark show that CHORUS improves the performance of open-source LLMs by a significant margin compared to baseline and conventional RAG.
arXiv Detail & Related papers (2025-05-02T16:36:57Z) - CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis [6.8081984950459]
Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect.<n>We propose CodeARC, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs.<n>We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions.
arXiv Detail & Related papers (2025-03-29T16:50:39Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.