FuguReport

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Authors John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press
Affiliations Stanford University / Meta / Harvard University
Categories Evaluation / Program Synthesis Evaluation / Measuring agent development ability, Method / Fuzzy Testing / Agent-guided testing for program correctness, Application / Software Engineering / Holistic software development by agents
License CC BY 4.0

Abstract Overview

ProgramBench introduces a benchmark for evaluating whether software-engineering agents can reconstruct full software projects from scratch when given only a compiled executable and its usage documentation. The benchmark comprises 200 open-source repositories spanning small CLI tools to large systems such as FFmpeg, SQLite, and PHP, and evaluates candidate solutions with hidden end-to-end behavioral tests generated through agent-driven fuzzing. This design measures holistic software development capability, including architecture and implementation choices, without constraining models to the original code structure or language. Across nine language models, the benchmark proved very difficult: no model fully solved any task, although the best model (Claude Opus 4.7) achieved ≥95% test pass rates on 3% of task instances.

Novelty

The paper's main novelty is an implementation-agnostic benchmark that evaluates behavioral reconstruction of complete programs rather than bug fixing, feature completion, or filling in predefined code skeletons. It also introduces a scalable task-construction and evaluation pipeline in which repositories are reduced to an executable plus documentation, and hidden tests are created via agent-driven fuzzing of observable behavior, requiring no existing test suite or language-specific tooling.

Results

None of the nine evaluated models fully resolved any of the 200 tasks; the strongest model (Claude Opus 4.7) passed at least 95% of tests on only 3% of task instances. The generated behavioral test suites achieved line coverage broadly comparable to developer-written suites (averaging 79.7% vs. 56.8% for native suites), supporting their usefulness as an evaluation signal. Analysis shows that model-produced codebases are significantly shorter (median 1,173 vs. 3,068 lines), use fewer files (median 3 vs. 15), and contain fewer but longer functions than the original human-written implementations.

Key Points

  1. ProgramBench measures end-to-end program reconstruction from an executable and documentation, emphasizing architectural and design decisions rather than localized code edits, with models free to use any programming language.
  2. The benchmark contains 200 diverse tasks (totaling 248,853 test functions) and uses hidden behavioral tests generated by agent-driven fuzzing to evaluate functional equivalence without prescribing implementation structure.
  3. Current language models make partial progress but do not fully solve any task, and their solutions diverge from human code organization by favoring fewer files (median 3 vs. 15), shallower directory structures, and longer functions.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.