Related papers: Automating Code Adaptation for MLOps -- A Benchmarking Study on LLMs

Automating Code Adaptation for MLOps -- A Benchmarking Study on LLMs

URL: http://arxiv.org/abs/2405.06835v1
Date: Fri, 10 May 2024 22:18:43 GMT
Title: Automating Code Adaptation for MLOps -- A Benchmarking Study on LLMs
Authors: Harsh Patel, Buvaneswari A. Ramanan, Manzoor A. Khan, Thomas Williams, Brian Friedman, Lawrence Drabeck,
Abstract summary: We evaluate the performance of OpenAI (gpt-3.5-turbo) and WizardCoder (open-source, 15B parameters) models on the automated accomplishment of various MLOps functionalities.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper explores the possibilities of the current generation of Large Language Models for incorporating Machine Learning Operations (MLOps) functionalities into ML training code bases. We evaluate the performance of OpenAI (gpt-3.5-turbo) and WizardCoder (open-source, 15B parameters) models on the automated accomplishment of various MLOps functionalities in different settings. We perform a benchmarking study that assesses the ability of these models to: (1) adapt existing code samples (Inlining) with component-specific MLOps functionality such as MLflow and Weights & Biases for experiment tracking, Optuna for hyperparameter optimization etc., and (2) perform the task of Translation from one component of an MLOps functionality to another, e.g., translating existing GitPython library based version control code to Data Version Control library based. We also propose three different approaches that involve teaching LLMs to comprehend the API documentation of the components as a reference while accomplishing the Translation tasks. In our evaluations, the gpt-3.5-turbo model significantly outperforms WizardCoder by achieving impressive Pass@3 accuracy in model optimization (55% compared to 0% by WizardCoder), experiment tracking (100%, compared to 62.5% by WizardCoder), model registration (92% compared to 42% by WizardCoder) and hyperparameter optimization (83% compared to 58% by WizardCoder) on average, in their best possible settings, showcasing its superior code adaptability performance in complex MLOps tasks.

Related papers

Function-to-Style Guidance of LLMs for Code Translation [59.487054943812836]
We propose F2STrans, a function-to-style guiding paradigm designed to improve the performance of large language models in code translation.<n>Our approach comprises two key stages: (1) Functional learning, which optimize translation correctness using high-quality source-target code pairs.<n>We introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations.
arXiv Detail & Related papers (2025-07-15T08:25:02Z)
CODEMENV: Benchmarking Large Language Models on Code Migration [11.735053997817765]
CODEMENV consists of 922 examples spanning 19 Python and Java packages.<n>It covers three core tasks: identifying functions incompatible with specific versions, detecting changes in function definitions, and adapting code to target environments.<n> Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%.
arXiv Detail & Related papers (2025-06-01T08:29:59Z)
Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach [6.289275189295223]
We investigate the relationship between code complexity and the success of Large Language Models generated code.<n>We propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs.<n>Experiment results show that our approach makes notable improvements, particularly with a smaller LLM.
arXiv Detail & Related papers (2025-05-29T19:06:14Z)
Improving Assembly Code Performance with Large Language Models via Reinforcement Learning [9.20863636863631]
Large language models (LLMs) have demonstrated strong performance across a wide range of programming tasks.<n>We present a reinforcement learning framework that trains LLMs using Proximal Policy Optimization (PPO)<n>Our model, Qwen2.5-Coder-7B-PPO, achieves 96.4% test pass rates and an average speedup of 1.47x over the gcc -O3 baseline.
arXiv Detail & Related papers (2025-05-16T17:40:45Z)
Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors. We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder. We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z)
Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation [0.0]
We propose a novel system that integrates Large Language Models (LLMs) for both classifying natural language inputs into corresponding API calls. Our system allows users to invoke complex software functionalities through simple inputs, improving interaction efficiency and lowering the barrier to software utilization.
arXiv Detail & Related papers (2024-09-18T04:56:52Z)
SpecTra: Enhancing the Code Translation Ability of Language Models by Generating Multi-Modal Specifications [17.60108067953814]
Large language models (LLMs) are increasingly being used for the task of automated code translation. We propose SpecTra, a multi-stage approach that uses a novel self-consistency filter to first generate high-quality specifications.
arXiv Detail & Related papers (2024-05-28T20:48:30Z)
Towards Modular LLMs by Building and Reusing a Library of LoRAs [64.43376695346538]
We study how to best build a library of adapters given multi-task data. We introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters.
arXiv Detail & Related papers (2024-05-18T03:02:23Z)
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z)
MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks. MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z)
ACPO: AI-Enabled Compiler Framework [1.752593459729982]
This paper presents ACPO: An AI-Enabled Compiler Framework. It provides LLVM with simple and comprehensive tools to benefit from employing ML models for different optimization passes. We show that ACPO can provide a combined speedup of 4.5% on Polybench and 2.4% on Cbench when compared with LLVM's O3.
arXiv Detail & Related papers (2023-12-15T17:49:24Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation [6.525197444717069]
GEVO-ML is a tool for discovering optimization opportunities and tuning the performance of Machine Learning kernels. We demonstrate GEVO-ML on two different ML workloads for both model training and prediction. GEVO-ML finds significant improvements for these models, achieving 90.43% performance improvement when model accuracy is relaxed by 2%.
arXiv Detail & Related papers (2023-10-16T09:24:20Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
MLGOPerf: An ML Guided Inliner to Optimize Performance [7.314201117946244]
This paper presents the first end-to-end framework capable of optimizing performance using LLVM's ML-Inliner. It employs a secondary ML model to generate rewards used for training a retargeted Reinforcement learning agent. It does so by predicting the post-inlining speedup of a function under analysis and it enables a fast training framework for the primary model.
arXiv Detail & Related papers (2022-07-18T05:47:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.