Related papers: KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization

KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization

URL: http://arxiv.org/abs/2601.21526v2
Date: Sat, 31 Jan 2026 20:40:35 GMT
Title: KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization
Authors: Alireza Nadafian, Alireza Mohammadshahi, Majid Yazdani,
Abstract summary: KAPSO is a modular framework for autonomous program synthesis and optimization.<n>It iteratively performs ideation, code synthesis and editing, execution, evaluation, and learning to improve a runnable artifact.
Score: 3.0268242725574215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce KAPSO, a modular framework for autonomous program synthesis and optimization. Given a natural language goal and an evaluation method, KAPSO iteratively performs ideation, code synthesis and editing, execution, evaluation, and learning to improve a runnable artifact toward measurable objectives. Rather than treating synthesis as the endpoint, KAPSO uses synthesis as an operator within a long-horizon optimization loop, where progress is defined by evaluator outcomes. KAPSO targets long-horizon failures common in coding agents, including lost experimental state, brittle debugging, and weak reuse of domain expertise, by integrating three tightly coupled components. First, a git-native experimentation engine isolates each attempt as a branch, producing reproducible artifacts and preserving provenance across iterations. Second, a knowledge system ingests heterogeneous sources, including repositories, internal playbooks, and curated external resources such as documentation, scientific papers, and web search results, and organizes them into a structured representation that supports retrieval over workflows, implementations, and environment constraints. Third, a cognitive memory layer coordinates retrieval and maintains an episodic store of reusable lessons distilled from experiment traces (run logs, diffs, and evaluator feedback), reducing repeated error modes and accelerating convergence. We evaluated KAPSO on MLE-Bench (Kaggle-style ML competitions) and ALE-Bench (AtCoder heuristic optimization), and report end-to-end performance. Code Available at: https://github.com/Leeroo-AI/kapso

Related papers

InfoSynth: Information-Guided Benchmark Synthesis for LLMs [69.80981631587501]
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation.<n>Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming.<n>This work introduces Info Synth, a novel framework for automatically generating and evaluating reasoning benchmarks.
arXiv Detail & Related papers (2026-01-02T05:26:27Z)
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem [90.17610617854247]
We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimize the production pipeline for agentic model.<n>ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering.<n>We release ROME, an open-source agent grounded by ALE and trained on over one million trajectories.
arXiv Detail & Related papers (2025-12-31T14:03:39Z)
CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning [34.38636514331703]
CLaRa is a unified framework that performs embedding-based compression and joint optimization in a shared continuous space.<n> Experiments show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
arXiv Detail & Related papers (2025-11-24T00:11:14Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4410890572479]
We introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification.<n>LoongBench is a curated seed dataset containing 8,729 human-vetted examples across 12 domains.<n>LoongEnv is a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples.
arXiv Detail & Related papers (2025-09-03T06:42:40Z)
DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis [52.636738269442766]
We introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis systems.<n>DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task.<n>We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API.
arXiv Detail & Related papers (2025-08-27T16:36:34Z)
HEAS: Hierarchical Evolutionary Agent Simulation Framework for Cross-Scale Modeling and Multi-Objective Search [4.807104001943257]
Hierarchical Simulation Agent (HEAS) is a Python framework that unifies layered agent-based modeling with evolutionary optimization and tournament evaluation.<n>HEAS represents models as hierarchies of lightweight processes ("streams") scheduled in deterministic layers that read and write a shared context.<n> compact API and CLI-simulate, optimize, evaluate-expose single- and multi-objective evolution.
arXiv Detail & Related papers (2025-08-21T13:35:46Z)
SWE-Bench-CL: Continual Learning for Coding Agents [0.0]
SWE-Bench-CL is a novel continual learning benchmark built on the human-verified SWE-Bench Verified dataset.<n>By organizing GitHub issues into chronologically ordered sequences that reflect natural repository evolution, SWE-Bench-CL enables direct evaluation of an agent's ability to accumulate experience.
arXiv Detail & Related papers (2025-06-13T07:11:14Z)
What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond [32.467437657603604]
Repository-level code generation remains challenging due to complex code dependencies and the limitations of large language models (LLMs) in processing long contexts.<n>We propose AllianceCoder, a novel context-integrated method that employs chain-of-thought prompting to decompose user queries into implementation steps and retrieves APIs via semantic description matching.<n>Through extensive experiments on CoderEval and RepoExec, AllianceCoder achieves state-of-the-art performance, improving Pass@1 by up to 20% over existing approaches.
arXiv Detail & Related papers (2025-03-26T14:41:38Z)
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z)
Comparative Code Structure Analysis using Deep Learning for Performance Prediction [18.226950022938954]
This paper aims to assess the feasibility of using purely static information (e.g., abstract syntax tree or AST) of applications to predict performance change based on the change in code structure. Our evaluations of several deep embedding learning methods demonstrate that tree-based Long Short-Term Memory (LSTM) models can leverage the hierarchical structure of source-code to discover latent representations and achieve up to 84% (individual problem) and 73% (combined dataset with multiple of problems) accuracy in predicting the change in performance.
arXiv Detail & Related papers (2021-02-12T16:59:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.