ProgramBench: Can Language Models Rebuild Programs From Scratch?
Abstract Overview
ProgramBench introduces a benchmark for evaluating whether software-engineering agents can reconstruct full software projects from scratch when given only a compiled executable and its usage documentation. The benchmark comprises 200 open-source repositories spanning small CLI tools to large systems such as FFmpeg, SQLite, and PHP, and evaluates candidate solutions with hidden end-to-end behavioral tests generated through agent-driven fuzzing. This design measures holistic software development capability, including architecture and implementation choices, without constraining models to the original code structure or language. Across nine language models, the benchmark proved very difficult: no model fully solved any task, although the best model (Claude Opus 4.7) achieved ≥95% test pass rates on 3% of task instances.
Novelty
The paper's main novelty is an implementation-agnostic benchmark that evaluates behavioral reconstruction of complete programs rather than bug fixing, feature completion, or filling in predefined code skeletons. It also introduces a scalable task-construction and evaluation pipeline in which repositories are reduced to an executable plus documentation, and hidden tests are created via agent-driven fuzzing of observable behavior, requiring no existing test suite or language-specific tooling.
Results
None of the nine evaluated models fully resolved any of the 200 tasks; the strongest model (Claude Opus 4.7) passed at least 95% of tests on only 3% of task instances. The generated behavioral test suites achieved line coverage broadly comparable to developer-written suites (averaging 79.7% vs. 56.8% for native suites), supporting their usefulness as an evaluation signal. Analysis shows that model-produced codebases are significantly shorter (median 1,173 vs. 3,068 lines), use fewer files (median 3 vs. 15), and contain fewer but longer functions than the original human-written implementations.
Key Points
- ProgramBench measures end-to-end program reconstruction from an executable and documentation, emphasizing architectural and design decisions rather than localized code edits, with models free to use any programming language.
- The benchmark contains 200 diverse tasks (totaling 248,853 test functions) and uses hidden behavioral tests generated by agent-driven fuzzing to evaluate functional equivalence without prescribing implementation structure.
- Current language models make partial progress but do not fully solve any task, and their solutions diverge from human code organization by favoring fewer files (median 3 vs. 15), shallower directory structures, and longer functions.