Related papers: CP-Bench: Evaluating Large Language Models for Constraint Modelling

CP-Bench: Evaluating Large Language Models for Constraint Modelling

URL: http://arxiv.org/abs/2506.06052v2
Date: Thu, 04 Sep 2025 09:10:05 GMT
Title: CP-Bench: Evaluating Large Language Models for Constraint Modelling
Authors: Kostis Michailidis, Dimos Tsouros, Tias Guns,
Abstract summary: Constraint programming (CP) is widely used to solve problems, but its core process, namely constraint modelling, requires significant expertise and is considered to be a bottleneck for wider adoption.<n>Recent studies have explored using Large Language Models (LLMs) to transform problem descriptions into executable constraint models.<n>Existing evaluation datasets for constraint modelling are often limited to small, homogeneous, or domain-specific instances, which do not capture the diversity of real-world scenarios.<n>This work addresses this gap by introducing CP-Bench, a novel benchmark that includes a diverse set of well-known problems sourced from the CP community, structured
Score: 6.250460397062786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Constraint Programming (CP) is widely used to solve combinatorial problems, but its core process, namely constraint modelling, requires significant expertise and is considered to be a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for constraint modelling are often limited to small, homogeneous, or domain-specific instances, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing CP-Bench, a novel benchmark that includes a diverse set of well-known combinatorial problems sourced from the CP community, structured explicitly for evaluating LLM-driven CP modelling. With this dataset, and given the variety of constraint modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 70% on this highly challenging benchmark.

Related papers

LOCUS: Low-Dimensional Model Embeddings for Efficient Model Exploration, Comparison, and Selection [15.182368486530128]
We propose LOCUS, a method that produces low-dimensional vector embeddings that compactly represent a language model's capabilities across queries.<n>LOCUS is an attention-based approach that generates embeddings by a deterministic forward pass over query encodings and evaluation scores via an encoder model.<n>We train a correctness predictor that uses model embeddings and query encodings to achieve state-of-the-art routing accuracy on unseen queries.
arXiv Detail & Related papers (2026-01-28T22:09:42Z)
The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models [54.51795784459866]
We propose a theoretical framework of performance scaling for multi-model collaboration.<n>We show that multi-model systems follow a power-law scaling with respect to the total parameter count.<n> ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family.
arXiv Detail & Related papers (2025-12-29T09:55:12Z)
When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling [1.052782170493037]
Large language models show impressive results in automatically generating models for classical benchmarks.<n>Many standard CP problems are likely included in the training data of these models.<n>We show that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation.
arXiv Detail & Related papers (2025-11-18T10:40:32Z)
An Integrated Fusion Framework for Ensemble Learning Leveraging Gradient Boosting and Fuzzy Rule-Based Models [59.13182819190547]
Fuzzy rule-based models excel in interpretability and have seen widespread application across diverse fields.<n>They face challenges such as complex design specifications and scalability issues with large datasets.<n>This paper proposes an Integrated Fusion Framework that merges the strengths of both paradigms to enhance model performance and interpretability.
arXiv Detail & Related papers (2025-11-11T10:28:23Z)
Black-box Model Merging for Language-Model-as-a-Service with Massive Model Repositories [21.899117703417517]
We propose a derivative-free optimization framework based on the evolutionary algorithm (Evo-Merging)<n>Our method consists of two key components: (1) sparsity-based denoising, designed to identify and filter out irrelevant or redundant information across models, and (2) sign-aware scaling, which dynamically computes optimal combination weights for the relevant models based on their performance.<n>Our approach achieves state-of-the-art results on a range of tasks, significantly outperforming existing strong baselines.
arXiv Detail & Related papers (2025-09-16T10:55:50Z)
Accurate and Consistent Graph Model Generation from Text with Large Language Models [1.9049294570026933]
Graph model generation from natural language description is an important task with many applications in software engineering.<n>With the rise of large language models (LLMs), there is a growing interest in using LLMs for graph model generation.<n>We propose a novel abstraction-concretization framework that enhances the consistency and quality of generated graph models.
arXiv Detail & Related papers (2025-08-01T01:52:25Z)
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models [50.19188692497892]
Traditional alignment methods often require retraining large pretrained models.<n>We propose a novel textitResidual Alignment Model (textitRAM) that formalizes the alignment process as a type of importance sampling.<n>We develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods.
arXiv Detail & Related papers (2025-05-26T08:53:02Z)
Relative Overfitting and Accept-Reject Framework [5.465098504510676]
We propose an ensemble framework that governs how models are segmented to ensure performance improvement.<n>We detail the patterns of this framework within the domain of NLP and briefly describe its to other fields, such as computer vision (CV) and AI for science.
arXiv Detail & Related papers (2025-05-12T17:36:14Z)
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo [90.78001821963008]
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints.<n>We develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC)<n>Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language.
arXiv Detail & Related papers (2025-04-17T17:49:40Z)
A Statistical Framework for Ranking LLM-Based Chatbots [57.59268154690763]
We propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis.<n>First, we introduce a factored tie model that enhances the ability to handle groupings of human-judged comparisons.<n>Second, we extend the framework to model covariance tiers between competitors, enabling deeper insights into performance relationships.<n>Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints.
arXiv Detail & Related papers (2024-12-24T12:54:19Z)
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z)
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild [84.57103623507082]
This paper introduces Model-GLUE, a holistic Large Language Models scaling guideline.<n>We benchmark existing scaling techniques, especially selective merging, and variants of mixture.<n>We then formulate an optimal strategy for the selection and aggregation of a heterogeneous model zoo.<n>Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters.
arXiv Detail & Related papers (2024-10-07T15:55:55Z)
Automatic Feature Learning for Essence: a Case Study on Car Sequencing [1.006631010704608]
We consider the task of building machine learning models to automatically select the best combination for a problem instance. A critical part of the learning process is to define instance features, which serve as input to the selection model. Our contribution is automatic learning of instance features directly from the high-level representation of a problem instance using a language model.
arXiv Detail & Related papers (2024-09-23T16:06:44Z)
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures. We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z)
Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. CMDPs serve as an important framework to model many real-world applications with time-varying environments. We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z)
Learning to Learn in Interactive Constraint Acquisition [7.741303298648302]
In Constraint Acquisition (CA), the goal is to assist the user by automatically learning the model. In (inter)active CA, this is done by interactively posting queries to the user. We propose to use probabilistic classification models to guide interactive CA to generate more promising queries.
arXiv Detail & Related papers (2023-12-17T19:12:33Z)
Data Summarization via Bilevel Optimization [48.89977988203108]
A simple yet powerful approach is to operate on small subsets of data. In this work, we propose a generic coreset framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem.
arXiv Detail & Related papers (2021-09-26T09:08:38Z)
Towards Portfolios of Streamlined Constraint Models: A Case Study with the Balanced Academic Curriculum Problem [1.8466814193413488]
We focus on the automatic addition of streamliner constraints, derived from the types present in an abstract Essence specification of a problem class of interest. The refinement of streamlined Essence specifications into constraint models gives rise to a large number of modelling choices. Various forms of racing are utilised to constrain the computational cost of training.
arXiv Detail & Related papers (2020-09-21T19:48:02Z)
Control as Hybrid Inference [62.997667081978825]
We present an implementation of CHI which naturally mediates the balance between iterative and amortised inference. We verify the scalability of our algorithm on a continuous control benchmark, demonstrating that it outperforms strong model-free and model-based baselines.
arXiv Detail & Related papers (2020-07-11T19:44:09Z)
PAC Bounds for Imitation and Model-based Batch Learning of Contextual Markov Decision Processes [31.83144400718369]
We consider the problem of batch multi-task reinforcement learning with observed context descriptors, motivated by its application to personalized medical treatment. We study two general classes of learning algorithms: direct policy learning (DPL), an imitation-learning based approach which learns from expert trajectories, and model-based learning.
arXiv Detail & Related papers (2020-06-11T11:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.