Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
Abstract Overview
This paper introduces SciCrafter, a Minecraft-based benchmark designed to evaluate whether AI agents can complete a discovery-to-application loop rather than only solve isolated tasks. The benchmark uses parameterized redstone circuit construction problems across five task families in which agents must discover mechanics and then apply that knowledge to build functional systems under increasing difficulty, with complexity scaling tied to discrete mechanism thresholds. The authors evaluate several frontier models (GPT-5.2, Gemini-3-Pro, Claude-Opus-4.5) and open models (Qwen3-32B, Qwen2.5-72B) under a common general-purpose code-agent scaffold (Claude Code) and find that performance plateaus at approximately 26% success. They further decompose failures into four capacity gaps—knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application—using targeted interventions whose marginal contributions serve as proxies for each gap.
Novelty
The main novelty is framing Minecraft redstone tasks as a controlled, scalable testbed for the discovery-to-application loop, with difficulty increases tied to discrete mechanism thresholds (e.g., signal attenuation, repeater semantics) rather than simple surface variation. The paper also proposes a diagnostic decomposition of agent failures into four capacity gaps and studies targeted interventions including oracle hints, a scientist sub-agent with a structured experimental template, and a "Claim-Proof-Constraints-Example" knowledge consolidation format.
Results
Across 25 tasks (5 families × 5 levels), the best baseline model (Gemini-3-Pro) achieves only 26.0% success. Oracle hints roughly double success rates (absolute gains of 15.0–27.0%), and adding the scientist sub-agent brings Gemini-3-Pro to 64.0%, while a residual application gap of 36.0–57.0% remains across models. The structured "Claim-Proof-Constraints-Example" consolidation format (64.0%) outperforms free-form summaries (58.0%) and the "Finding-Explanation-Example" format (60.5%) on Gemini-3-Pro, and the analysis reveals that while knowledge application remains the largest overall gap, knowledge gap identification is becoming a comparably important bottleneck for frontier models.
Key Points
- SciCrafter operationalizes the discovery-to-application loop through five families of scalable Minecraft redstone construction tasks whose difficulty crosses discrete mechanism thresholds (local wiring grammar, signal attenuation, repeater semantics), requiring genuine discovery rather than memorized solutions.
- Under a standardized Claude Code agent scaffold, frontier models (GPT-5.2, Gemini-3-Pro, Claude-Opus-4.5, Grok-4) achieve only 21.0%–26.0% baseline success across 25 tasks, suggesting that scaling model size alone does not resolve discovery-to-application bottlenecks.
- Targeted interventions reveal differentiated capacity gaps: oracle hints yield the largest single improvement (15.0–27.0% absolute gain), the scientist sub-agent adds 7.5–14.0% further, and the structured Claim-Proof-Constraints-Example consolidation format outperforms free-form summaries (64.0% vs. 58.0% on Gemini-3-Pro), while a substantial residual application gap (36.0–57.0%) persists.