Related papers: When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling

When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling

URL: http://arxiv.org/abs/2511.14334v2
Date: Wed, 19 Nov 2025 10:26:11 GMT
Title: When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling
Authors: Alessio Pellegrino, Jacopo Mauro,
Abstract summary: Large language models show impressive results in automatically generating models for classical benchmarks.<n>Many standard CP problems are likely included in the training data of these models.<n>We show that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation.
Score: 1.052782170493037
License: http://creativecommons.org/licenses/by/4.0/
Abstract: One of the long-standing goals in optimisation and constraint programming is to describe a problem in natural language and automatically obtain an executable, efficient model. Large language models appear to bring this vision closer, showing impressive results in automatically generating models for classical benchmarks. However, much of this apparent success may derive from data contamination rather than genuine reasoning: many standard CP problems are likely included in the training data of these models. To examine this hypothesis, we systematically rephrased and perturbed a set of well-known CSPLib problems to preserve their structure while modifying their context and introducing misleading elements. We then compared the models produced by three representative LLMs across original and modified descriptions. Our qualitative analysis shows that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation, revealing shallow understanding and sensitivity to wording.

Related papers

Large Language Model-Based Automatic Formulation for Stochastic Optimization Models [0.0]
This paper presents the first integrated systematic study on the performance of large language models (LLMs)<n>We design several prompts that guide ChatGPT through structured tasks using chain-of- thought and modular reasoning.<n>Across a diverse set of problems, GPT-4-Turbo outperforms other models in partial score, variable matching, objective accuracy, with cot_s_instructions emerging as the most effective prompting strategies.
arXiv Detail & Related papers (2025-08-24T03:31:25Z)
Accurate and Consistent Graph Model Generation from Text with Large Language Models [1.9049294570026933]
Graph model generation from natural language description is an important task with many applications in software engineering.<n>With the rise of large language models (LLMs), there is a growing interest in using LLMs for graph model generation.<n>We propose a novel abstraction-concretization framework that enhances the consistency and quality of generated graph models.
arXiv Detail & Related papers (2025-08-01T01:52:25Z)
Language Models can perform Single-Utterance Self-Correction of Perturbed Reasoning [4.768151813962547]
Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities.<n>Their performance remains brittle to minor variations in problem description and prompting strategy.<n>To better understand self-correction capabilities of recent models, we conduct experiments measuring models' ability to self-correct synthetics.
arXiv Detail & Related papers (2025-06-18T21:35:44Z)
CP-Bench: Evaluating Large Language Models for Constraint Modelling [6.250460397062786]
Constraint programming (CP) is widely used to solve problems, but its core process, namely constraint modelling, requires significant expertise and is considered to be a bottleneck for wider adoption.<n>Recent studies have explored using Large Language Models (LLMs) to transform problem descriptions into executable constraint models.<n>Existing evaluation datasets for constraint modelling are often limited to small, homogeneous, or domain-specific instances, which do not capture the diversity of real-world scenarios.<n>This work addresses this gap by introducing CP-Bench, a novel benchmark that includes a diverse set of well-known problems sourced from the CP community, structured
arXiv Detail & Related papers (2025-06-06T12:56:02Z)
SALAD: Improving Robustness and Generalization through Contrastive Learning with Structure-Aware and LLM-Driven Augmented Data [15.366930934639838]
We propose SALAD, a novel approach to enhance model robustness and generalization.<n>Our method generates structure-aware and counterfactually augmented data for contrastive learning.<n>We validate our approach through experiments on three tasks: Sentiment Classification, Sexism Detection, and Natural Language Inference.
arXiv Detail & Related papers (2025-04-16T15:40:10Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation [1.2576388595811496]
We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language.<n>We permute reasoning problems written in real languages to generate numerous question variations.<n>Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge.
arXiv Detail & Related papers (2025-03-04T19:57:47Z)
How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z)
Rethinking Masked Language Modeling for Chinese Spelling Correction [70.85829000570203]
We study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model. We find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns. We demonstrate that a very simple strategy, randomly masking 20% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model.
arXiv Detail & Related papers (2023-05-28T13:19:12Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.