Related papers: CP-Bench: Evaluating Large Language Models for Constraint Modelling

CP-Bench: Evaluating Large Language Models for Constraint Modelling

URL: http://arxiv.org/abs/2506.06052v1
Date: Fri, 06 Jun 2025 12:56:02 GMT
Title: CP-Bench: Evaluating Large Language Models for Constraint Modelling
Authors: Kostis Michailidis, Dimos Tsouros, Tias Guns,
Abstract summary: Constraint Programming (CP) is a well-suited problem-solving paradigm, but its core process, namely constraint modelling, is a bottleneck for wider adoption.<n>Recent studies have explored using Large Language Models (LLMs) as modelling assistants, transforming problem descriptions to executable constraint models.<n>This work addresses the gap by introducing CP-Bench, a novel benchmark dataset that includes a diverse set of well-known problem classes.
Score: 6.273426548149088
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Combinatorial problems are present in a wide range of industries. Constraint Programming (CP) is a well-suited problem-solving paradigm, but its core process, namely constraint modelling, is a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) as modelling assistants, transforming combinatorial problem descriptions to executable constraint models, similar to coding assistants. However, the existing evaluation datasets for constraint modelling are often limited to small, homogeneous, or domain-specific instances, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing CP-Bench, a novel benchmark dataset that includes a diverse set of well-known combinatorial problem classes sourced from the CP community, structured explicitly for evaluating LLM-driven CP modelling. With this dataset, and given the variety of constraint modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax: the high-level MiniZinc language and Python-based CPMpy library, and the lower-level Python interface of the OR-Tools CP-SAT solver. In order to enhance the ability of LLMs to produce valid constraint models, we systematically evaluate the use of prompt-based and inference-time compute methods adapted from existing LLM-based code generation research. Our results underscore the modelling convenience provided by Python-based frameworks, as well as the effectiveness of documentation-rich system prompts, which, augmented with repeated sampling and self-verification, achieve further improvements, reaching up to 70\% accuracy on this new, highly challenging benchmark.

Related papers

Accurate and Consistent Graph Model Generation from Text with Large Language Models [1.9049294570026933]
Graph model generation from natural language description is an important task with many applications in software engineering.<n>With the rise of large language models (LLMs), there is a growing interest in using LLMs for graph model generation.<n>We propose a novel abstraction-concretization framework that enhances the consistency and quality of generated graph models.
arXiv Detail & Related papers (2025-08-01T01:52:25Z)
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models [50.19188692497892]
Traditional alignment methods often require retraining large pretrained models.<n>We propose a novel textitResidual Alignment Model (textitRAM) that formalizes the alignment process as a type of importance sampling.<n>We develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods.
arXiv Detail & Related papers (2025-05-26T08:53:02Z)
Relative Overfitting and Accept-Reject Framework [5.465098504510676]
We propose an ensemble framework that governs how models are segmented to ensure performance improvement.<n>We detail the patterns of this framework within the domain of NLP and briefly describe its to other fields, such as computer vision (CV) and AI for science.
arXiv Detail & Related papers (2025-05-12T17:36:14Z)
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo [90.78001821963008]
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints.<n>We develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC)<n>Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language.
arXiv Detail & Related papers (2025-04-17T17:49:40Z)
A Statistical Framework for Ranking LLM-Based Chatbots [57.59268154690763]
We propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis.<n>First, we introduce a factored tie model that enhances the ability to handle groupings of human-judged comparisons.<n>Second, we extend the framework to model covariance tiers between competitors, enabling deeper insights into performance relationships.<n>Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints.
arXiv Detail & Related papers (2024-12-24T12:54:19Z)
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z)
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild [84.57103623507082]
This paper introduces Model-GLUE, a holistic Large Language Models scaling guideline.<n>We benchmark existing scaling techniques, especially selective merging, and variants of mixture.<n>We then formulate an optimal strategy for the selection and aggregation of a heterogeneous model zoo.<n>Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters.
arXiv Detail & Related papers (2024-10-07T15:55:55Z)
Automatic Feature Learning for Essence: a Case Study on Car Sequencing [1.006631010704608]
We consider the task of building machine learning models to automatically select the best combination for a problem instance. A critical part of the learning process is to define instance features, which serve as input to the selection model. Our contribution is automatic learning of instance features directly from the high-level representation of a problem instance using a language model.
arXiv Detail & Related papers (2024-09-23T16:06:44Z)
Learning to Learn in Interactive Constraint Acquisition [7.741303298648302]
In Constraint Acquisition (CA), the goal is to assist the user by automatically learning the model. In (inter)active CA, this is done by interactively posting queries to the user. We propose to use probabilistic classification models to guide interactive CA to generate more promising queries.
arXiv Detail & Related papers (2023-12-17T19:12:33Z)
Data Summarization via Bilevel Optimization [48.89977988203108]
A simple yet powerful approach is to operate on small subsets of data. In this work, we propose a generic coreset framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem.
arXiv Detail & Related papers (2021-09-26T09:08:38Z)
Towards Portfolios of Streamlined Constraint Models: A Case Study with the Balanced Academic Curriculum Problem [1.8466814193413488]
We focus on the automatic addition of streamliner constraints, derived from the types present in an abstract Essence specification of a problem class of interest. The refinement of streamlined Essence specifications into constraint models gives rise to a large number of modelling choices. Various forms of racing are utilised to constrain the computational cost of training.
arXiv Detail & Related papers (2020-09-21T19:48:02Z)
PAC Bounds for Imitation and Model-based Batch Learning of Contextual Markov Decision Processes [31.83144400718369]
We consider the problem of batch multi-task reinforcement learning with observed context descriptors, motivated by its application to personalized medical treatment. We study two general classes of learning algorithms: direct policy learning (DPL), an imitation-learning based approach which learns from expert trajectories, and model-based learning.
arXiv Detail & Related papers (2020-06-11T11:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.