Related papers: CP-Agent: Agentic Constraint Programming

CP-Agent: Agentic Constraint Programming

URL: http://arxiv.org/abs/2508.07468v1
Date: Sun, 10 Aug 2025 19:59:01 GMT
Title: CP-Agent: Agentic Constraint Programming
Authors: Stefan Szeider,
Abstract summary: Translating natural language problem descriptions into formal constraint models is a fundamental challenge in constraint programming.<n>Previous approaches have employed fixed with predetermined modeling steps, failing on a significant number of benchmark problems.<n>We present a new approach using a pure agentic strategy without any fixed pipeline.
Score: 23.191983095692223
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Translating natural language problem descriptions into formal constraint models remains a fundamental challenge in constraint programming, requiring deep expertise in both the problem domain and modeling frameworks. Previous approaches to automating this translation have employed fixed workflows with predetermined modeling steps, failing on a significant number of benchmark problems. We present a new approach using a pure agentic strategy without any fixed pipeline. We developed a general-purpose Python coding agent based on the ReAct (Reason and Act) principle, utilizing a persistent IPython kernel for stateful code execution and iterative development. Rather than embedding constraint programming logic into the agent architecture, domain-specific expertise is injected solely through a carefully crafted project prompt. The agent combines this prompt-encoded knowledge with access to file operations and code execution tools, enabling it to test hypotheses, debug failures, and verify solutions dynamically. Implemented in just a few hundred lines of code, this architecture successfully solves all 101 problems of the CP-Bench constraint programming benchmark set. The results suggest that constraint modeling tasks require the combination of general coding tools and domain expertise encoded in prompts, rather than specialized agent architectures or predefined workflows.

Related papers

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents [79.29376673236142]
Existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems.<n>We present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents.
arXiv Detail & Related papers (2025-12-14T15:12:13Z)
IACT: A Self-Organizing Recursive Model for General AI Agents: A Technical White Paper on the Architecture Behind kragent.ai [0.0]
Interactive Agents Call Tree (IACT) is a general-purpose autonomous system driven purely by user dialogue.<n>We describe the architecture, design principles, and practical lessons behind the deployment of this model in the kragent.ai system.
arXiv Detail & Related papers (2025-12-02T10:10:56Z)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [59.90381306452982]
evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
arXiv Detail & Related papers (2025-11-07T18:01:32Z)
Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning [65.20602712957725]
Caco is a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data.<n>Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.
arXiv Detail & Related papers (2025-10-05T07:59:24Z)
Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision [4.55391222496256]
Large language models (LLMs) serve as an active and promising field of generative artificial intelligence.<n>In this work, we construct a novel agent framework for solving representative problems in scientific computing.<n>The proposed agent, incorporating a "rewriting-resolution-review-revision" logical chain, is integrated in a collaborative and interactive manner.
arXiv Detail & Related papers (2025-08-28T12:50:48Z)
Blueprint First, Model Second: A Framework for Deterministic LLM Workflow [3.9886771197662925]
We introduce the Source Code Agent framework, a new paradigm built on the "Blueprint First, Model Second" philosophy.<n>Our framework decouples the workflow logic from the generative model.<n>Our work enables the verifiable and reliable deployment of autonomous agents in applications governed by strict procedural logic.
arXiv Detail & Related papers (2025-08-01T03:10:00Z)
AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation [0.0]
We propose a Python-based framework that uses multiple cooperating LLM-powered agents to automate software development tasks.<n>In AgentMesh, specialized agents - a Planner, Coder, Debugger, and Reviewer - work in concert to transform a high-level requirement into fully realized code.
arXiv Detail & Related papers (2025-07-26T10:10:02Z)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z)
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents [40.37993572657772]
We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions.<n>We demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.
arXiv Detail & Related papers (2025-05-30T19:23:51Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problems with Reasoning LLM [15.260794368585692]
We propose OR-LLM-Agent, an AI agent framework built on reasoning LLMs for automated Operations Research problem solving.<n>We show that OR-LLM-Agent utilizing DeepSeek-R1 in its framework outperforms advanced methods, including GPT-o3, Gemini 2.5 Pro, DeepSeek-R1, and ORLM, by at least 7% in accuracy.
arXiv Detail & Related papers (2025-03-13T03:40:50Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models [81.12673534903979]
Tool learning has emerged as a crucial capability for large language models (LLMs) to solve complex real-world tasks through interaction with external tools.<n>We propose ToolCoder, a novel framework that reformulates tool learning as a code generation task.
arXiv Detail & Related papers (2025-02-17T03:42:28Z)
AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions [10.686849324750556]
Automated gEneral buG reproductIon Scripts generation framework, named AEGIS, is the first agent-based framework for the task.<n>AEGIS can improve the relative resolved rate of Agentless by 12.5%.
arXiv Detail & Related papers (2024-11-27T03:16:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.