Related papers: CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

URL: http://arxiv.org/abs/2503.23145v1
Date: Sat, 29 Mar 2025 16:50:39 GMT
Title: CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis
Authors: Anjiang Wei, Tarun Suresh, Jiannan Cao, Naveen Kannan, Yuheng Wu, Kai Yan, Thiago S. F. X. Teixeira, Ke Wang, Alex Aiken,
Abstract summary: Large language model agents have shown promise in programming tasks guided by natural language.<n>Existing evaluation protocols rely on static sets of examples and held-out tests.<n>We propose CodeARC, a new evaluation framework where agents interact with a hidden target function.
Score: 6.8081984950459
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning.

Related papers

Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation.<n>Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs.<n>We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z)
Inductive Predicate Synthesis Modulo Programs (Extended) [1.7372615815088566]
A growing trend in program analysis is to encode verification conditions within the language of the input program. We propose Inductive Predicate Synthesis Modulo Programs (IPS-MP) which extends high-level languages with minimal synthesis features.
arXiv Detail & Related papers (2024-07-11T12:51:08Z)
HYSYNTH: Context-Free LLM Approximation for Guiding Program Synthesis [25.260063704712458]
Large language models (LLMs) often fail to produce fully correct programs in unfamiliar DSLs. Motivated by these limitations, we introduce a hybrid approach, where LLM completions for a given task are used to learn a task-specific, context-free surrogate model. We evaluate this hybrid approach on three domains, and show that it outperforms both unguided search and direct sampling from LLMs, as well as existing program synthesizers.
arXiv Detail & Related papers (2024-05-24T18:45:51Z)
Guiding Enumerative Program Synthesis with Large Language Models [15.500250058226474]
In this paper, we evaluate the abilities of Large Language Models to solve formal synthesis benchmarks. When one-shot synthesis fails, we propose a novel enumerative synthesis algorithm. We find that GPT-3.5 as a stand-alone tool for formal synthesis is easily outperformed by state-of-the-art formal synthesis algorithms.
arXiv Detail & Related papers (2024-03-06T19:13:53Z)
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis [54.18659323181771]
We characterize several different forms of compositional generalization that are desirable in program synthesis. We propose ExeDec, a novel decomposition-based strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step.
arXiv Detail & Related papers (2023-07-26T01:07:52Z)
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control [23.115574119132507]
Building agents with large language models for computer control is a burgeoning research area, where the agent receives computer states and performs actions to complete tasks. Previous computer agents have demonstrated the benefits of in-context learning (ICL), but their performance is hindered by several issues. We introduce Synapse, a computer agent featuring three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions to improve multi-step decision-
arXiv Detail & Related papers (2023-06-13T15:49:41Z)
Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages [97.58968222942173]
We take the first step to synthesize C programs from input-output examples. In particular, we propose La Synth, which learns the latent representation to approximate the execution of partially generated programs. We show that training on these synthesized programs further improves the prediction performance for both Karel and C program synthesis.
arXiv Detail & Related papers (2021-06-29T02:21:32Z)
Leveraging Language to Learn Program Abstractions and Search Heuristics [66.28391181268645]
We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization.
arXiv Detail & Related papers (2021-06-18T15:08:47Z)
BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration [72.88493072196094]
We present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a set of input-output examples. We show that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches.
arXiv Detail & Related papers (2020-07-28T17:46:18Z)
Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis [81.54148730967394]
We propose SED, a neural program generation framework that incorporates synthesis, execution, and debug stages. SED first produces initial programs using the neural program synthesizer component, then utilizes a neural program debugger to iteratively repair the generated programs. On Karel, a challenging input-output program synthesis benchmark, SED reduces the error rate of the neural program synthesizer itself by a considerable margin, and outperforms the standard beam search for decoding.
arXiv Detail & Related papers (2020-07-16T04:15:47Z)
Towards Neural-Guided Program Synthesis for Linear Temporal Logic Specifications [26.547133495699093]
We use a neural network to learn a Q-function that is then used to guide search, and to construct programs that are subsequently verified for correctness. Our method is unique in combining search with deep learning to realize synthesis.
arXiv Detail & Related papers (2019-12-31T17:09:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.