CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation
- URL: http://arxiv.org/abs/2401.06786v1
- Date: Fri, 10 Nov 2023 01:49:57 GMT
- Title: CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation
- Authors: Yifei Xu, Yuning Chen, Xumiao Zhang, Xianshang Lin, Pan Hu, Yunfei Ma,
Songwu Lu, Wan Du, Zhuoqing Mao, Ennan Zhai, Dennis Cai
- Abstract summary: We present CloudEval-YAML, a practical benchmark for cloud configuration generation.
The dataset consists of hand-written problems with unit tests targeting practical scenarios.
The dataset consists of 1011 problems that take more than 1200 human hours to complete.
- Score: 9.320732264679238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Among the thriving ecosystem of cloud computing and the proliferation of
Large Language Model (LLM)-based code generation tools, there is a lack of
benchmarking for code generation in cloud-native applications. In response to
this need, we present CloudEval-YAML, a practical benchmark for cloud
configuration generation. CloudEval-YAML tackles the diversity challenge by
focusing on YAML, the de facto standard of numerous cloud-native tools. We
develop the CloudEval-YAML benchmark with practicality in mind: the dataset
consists of hand-written problems with unit tests targeting practical
scenarios. We further enhanced the dataset to meet practical needs by
rephrasing questions in a concise, abbreviated, and bilingual manner. The
dataset consists of 1011 problems that take more than 1200 human hours to
complete. To improve practicality during evaluation, we build a scalable
evaluation platform for CloudEval-YAML that achieves a 20 times speedup over a
single machine. To the best of our knowledge, the CloudEval-YAML dataset is the
first hand-written dataset targeting cloud-native applications. We present an
in-depth evaluation of 12 LLMs, leading to a deeper understanding of the
problems and LLMs, as well as effective methods to improve task performance and
reduce cost.
Related papers
- AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning [36.37717583840935]
We propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs.
Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, and the cloud agent equipped with a larger LLM.
This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent.
arXiv Detail & Related papers (2024-10-17T03:07:37Z) - Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance [0.0]
Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks.
Smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts.
This paper presents a novel hybrid inference approach that leverages the strengths of both model types.
arXiv Detail & Related papers (2024-09-15T15:12:45Z) - Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models.
We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - Accelerated Cloud for Artificial Intelligence (ACAI) [24.40451195277244]
We propose an end-to-end cloud-based machine learning platform, Accelerated Cloud for AI (ACAI)
ACAI enables cloud-based storage of indexed, labeled, and searchable data, as well as automatic resource provisioning, job scheduling, and experiment tracking.
We show that our auto-provisioner produces a 1.7x speed-up and 39% cost reduction, and our system reduces experiment time for ML scientists by 20% on typical ML use cases.
arXiv Detail & Related papers (2024-01-30T07:09:48Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Scaling Data Science Solutions with Semantics and Machine Learning:
Bosch Case [8.445414390004636]
SemCloud is a semantics-enhanced cloud system with semantic technologies and machine learning.
The system has been evaluated in industrial use case with millions of data, thousands of repeated runs, and domain users, showing promising results.
arXiv Detail & Related papers (2023-08-02T11:58:30Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Machine learning for cloud resources management -- An overview [0.0]
This study explores the most important cloud resources management issues that have been combined with Machine Learning.
A big collection of researches is used to make sensible comparisons between the ML techniques that are used in the different kinds of cloud resources management fields.
We propose the most suitable ML model for each field.
arXiv Detail & Related papers (2021-01-28T13:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.