Related papers: Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats

Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats

URL: http://arxiv.org/abs/2509.05303v1
Date: Thu, 21 Aug 2025 22:37:18 GMT
Title: Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats
Authors: Sam Davidson, Li Sun, Bhavana Bhasker, Laurent Callot, Anoop Deoras,
Abstract summary: We present Multi-IaC-Bench, a novel benchmark dataset for evaluating Large Language Models (LLMs)-based IaC generation and mutation.<n>The dataset consists of triplets containing initial IaC templates, natural language modification requests, and corresponding updated templates.<n>We evaluate several state-of-the-art LLMs on Multi-IaC-Bench, demonstrating that while modern LLMs can achieve high success rates (>95%) in generating syntactically valid IaC across formats, significant challenges remain in semantic alignment and handling complex infrastructure patterns.
Score: 12.813627159588032
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Infrastructure as Code (IaC) is fundamental to modern cloud computing, enabling teams to define and manage infrastructure through machine-readable configuration files. However, different cloud service providers utilize diverse IaC formats. The lack of a standardized format requires cloud architects to be proficient in multiple IaC languages, adding complexity to cloud deployment. While Large Language Models (LLMs) show promise in automating IaC creation and maintenance, progress has been limited by the lack of comprehensive benchmarks across multiple IaC formats. We present Multi-IaC-Bench, a novel benchmark dataset for evaluating LLM-based IaC generation and mutation across AWS CloudFormation, Terraform, and Cloud Development Kit (CDK) formats. The dataset consists of triplets containing initial IaC templates, natural language modification requests, and corresponding updated templates, created through a synthetic data generation pipeline with rigorous validation. We evaluate several state-of-the-art LLMs on Multi-IaC-Bench, demonstrating that while modern LLMs can achieve high success rates (>95%) in generating syntactically valid IaC across formats, significant challenges remain in semantic alignment and handling complex infrastructure patterns. Our ablation studies highlight the importance of prompt engineering and retry mechanisms in successful IaC generation. We release Multi-IaC-Bench to facilitate further research in AI-assisted infrastructure management and establish standardized evaluation metrics for this crucial domain.

Related papers

Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation [48.88299242238335]
Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models.<n>We propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture.
arXiv Detail & Related papers (2026-02-13T13:48:08Z)
SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling [0.0]
We introduce SITS-DECO, a proof-of-concept generative model that applies unified-sequence framing to EO data.<n>We show that the model can perform multiple supervised and self-supervised tasks within a single unified architecture.<n>Despite its simplicity and lack of spatial context, SITS-DECO outperforms much larger EO foundation models on crop-type classification.
arXiv Detail & Related papers (2025-10-21T14:42:55Z)
LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding [55.5535016040221]
LM-Searcher is a novel framework for cross-domain neural architecture optimization.<n>Central to our approach is NCode, a universal numerical string representation for neural architectures.<n>Our dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning.
arXiv Detail & Related papers (2025-09-06T09:26:39Z)
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation [91.45421429922506]
OneCAT is a unified multimodal model that seamlessly integrates understanding, generation, and editing.<n>Our framework eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference.
arXiv Detail & Related papers (2025-09-03T17:29:50Z)
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models [51.817121227562964]
Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models.<n> Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties.<n>The traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment.
arXiv Detail & Related papers (2025-08-13T14:13:46Z)
JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation [3.6946337486060776]
JARVIS is a novel multi-agent framework that leverages Large Language Models (LLMs) and domain expertise to generate high-quality scripts for EDA tasks.<n>By combining a domain-specific LLM trained with synthetically generated data, a custom compiler for structural verification, rule enforcement, code fixing capabilities, and advanced retrieval mechanisms, our approach achieves significant improvements over state-of-the-art domain-specific models.
arXiv Detail & Related papers (2025-05-20T23:40:57Z)
OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions [58.46747176834132]
We present OmniEvalKit, a novel benchmarking toolbox designed to evaluate Large Language Models (LLMs)<n>Unlike existing benchmarks that often focus on a single aspect, OmniEvalKit provides a modular, lightweight, and automated evaluation system.<n>It is structured with a modular architecture comprising a Static Builder and Dynamic Data Flow, promoting the seamless integration of new models and datasets.
arXiv Detail & Related papers (2024-12-09T17:39:43Z)
A Survey of using Large Language Models for Generating Infrastructure as Code [3.514825979961616]
Infrastructure as Code (IaC) is a revolutionary approach which has gained prominence in the Industry. We study the feasibility of applying Large Language Models (LLM) to address this problem.
arXiv Detail & Related papers (2024-03-30T02:57:55Z)
Statically Inferring Usage Bounds for Infrastructure as Code [0.9886108751871757]
Infrastructure as Code (IaC) has enabled cloud customers to have more agility in creating and modifying complex deployments of cloud-provisioned resources. We propose a tool for fine-grained static usage analysis that works by modeling the inter-resource interactions in an IaC deployment as a set of constraints.
arXiv Detail & Related papers (2024-02-23T22:27:56Z)
Cloud-Device Collaborative Learning for Multimodal Large Language Models [24.65882336700547]
We introduce a Cloud-Device Collaborative Continual Adaptation framework to enhance the performance of compressed, device-deployed MLLMs. Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment.
arXiv Detail & Related papers (2023-12-26T18:46:14Z)
Benchmarking Diverse-Modal Entity Linking with Generative Models [78.93737257356784]
We construct a benchmark for diverse-modal EL (DMEL) from existing EL datasets. To approach the DMEL task, we proposed a generative diverse-modal model (GDMM) following a multimodal-encoder-decoder paradigm. GDMM builds a stronger DMEL baseline, outperforming state-of-the-art task-specific EL models by 8.51 F1 score on average.
arXiv Detail & Related papers (2023-05-27T02:38:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.