Related papers: A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

URL: http://arxiv.org/abs/2603.05278v1
Date: Thu, 05 Mar 2026 15:23:02 GMT
Title: A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models
Authors: David Delgado, Lola Burgueño, Robert Clarisó,
Abstract summary: Large language models (LLMs) can be used to support software development tasks, e.g., through code completion or code generation.<n>We propose a generic framework for evaluating the capabilities of LLMs generating DSL code from textual specifications.<n>This framework is applied to a particular type of DSL, constraint languages, focusing our experiments on OCL and Alloy.
Score: 1.2234742322758418
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) can be used to support software development tasks, e.g., through code completion or code generation. However, their effectiveness drops significantly when considering less popular programming languages such as domain-specific languages (DSLs). In this paper, we propose a generic framework for evaluating the capabilities of LLMs generating DSL code from textual specifications. The generated code is assessed from the perspectives of well-formedness and correctness. This framework is applied to a particular type of DSL, constraint languages, focusing our experiments on OCL and Alloy and comparing their results to those achieved for Python, a popular general-purpose programming language. Experimental results show that, in general, LLMs have better performance for Python than for OCL and Alloy. LLMs with smaller context windows such as open-source LLMs may be unable to generate constraint-related code, as this requires managing both the constraint and the domain model where it is defined. Moreover, some improvements to the code generation process such as code repair (asking an LLM to fix incorrect code) or multiple attempts (generating several candidates for each coding task) can improve the quality of the generated code. Meanwhile, other decisions like the choice of a prompt template have less impact. All these dimensions can be systematically analyzed using our evaluation framework, making it possible to decide the most effective way to set up code generation for a particular type of task.

Related papers

Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML [3.5515013986822073]
We present a case study conducted in collaboration with the leveling department at A.<n>We investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment.<n>The findings reveal that prompting techniques and model size have a significant impact on output quality.
arXiv Detail & Related papers (2025-09-15T19:39:26Z)
The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion [4.215010577170175]
We evaluate the confidence of Large Language Models (LLMs) when generating code by measuring code perplexity.<n>We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages.<n> Perl appears universally high in perplexity, whereas Java appears low.
arXiv Detail & Related papers (2025-08-22T06:51:13Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Can LLMs Replace Humans During Code Chunking? [2.4056836012742]
Large language models (LLMs) have become essential tools in computer science, especially for tasks involving code understanding and generation.<n>This paper examines the application of LLMs in the modernization of legacy government code written in ALC and MUMPS.
arXiv Detail & Related papers (2025-06-24T13:02:35Z)
Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings [32.72039589832989]
Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency.<n>These advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards.<n>We propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains.
arXiv Detail & Related papers (2025-03-17T21:41:37Z)
Effective LLM-Driven Code Generation with Pythoness [0.0]
Pythoness is an embedded domain-specific language for code generation using large language models (LLMs)<n>In Pythoness, developers operate at the level of behavioral specifications when writing functions, classes, or an entire program.<n>We show that Pythoness can successfully leverage a combination of tests and code generation to yield higher quality code than specifications alone.
arXiv Detail & Related papers (2025-01-03T23:14:46Z)
Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities. The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge.<n>CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.