Related papers: On the effectiveness of Large Language Models for GitHub Workflows

On the effectiveness of Large Language Models for GitHub Workflows

URL: http://arxiv.org/abs/2403.12446v1
Date: Tue, 19 Mar 2024 05:14:12 GMT
Title: On the effectiveness of Large Language Models for GitHub Workflows
Authors: Xinyu Zhang, Siddharth Muralee, Sourag Cherupattamoolayil, Aravind Machiry,
Abstract summary: Large Language Models (LLMs) have demonstrated their effectiveness in various software development tasks. We perform the first comprehensive study to understand the effectiveness of LLMs on five workflow-related tasks with different levels of prompts. Our evaluation of three state-of-art LLMs and their fine-tuned variants revealed various interesting findings on the current effectiveness and drawbacks of LLMs.
Score: 9.82254417875841
License: http://creativecommons.org/licenses/by/4.0/
Abstract: GitHub workflows or GitHub CI is a popular continuous integration platform that enables developers to automate various software engineering tasks by specifying them as workflows, i.e., YAML files with a list of jobs. However, engineering valid workflows is tedious. They are also prone to severe security issues, which can result in supply chain vulnerabilities. Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in various software development tasks. However, GitHub workflows differ from regular programs in both structure and semantics. We perform the first comprehensive study to understand the effectiveness of LLMs on five workflow-related tasks with different levels of prompts. We curated a set of $\sim$400K workflows and generated prompts with varying detail. We also fine-tuned LLMs on GitHub workflow tasks. Our evaluation of three state-of-the-art LLMs and their fine-tuned variants revealed various interesting findings on the current effectiveness and drawbacks of LLMs.

Related papers

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z)
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration. It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories. LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms. We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
Automatic Categorization of GitHub Actions with Transformers and Few-shot Learning [12.254055731378045]
GitHub Actions (GHA) have been conceived to provide developers with a practical tool to create and maintain a pipeline. To expose actions to search engines, GitHub allows developers to assign them to one or more categories manually. We propose Gavel, a practical solution to increasing the visibility of actions in GitHub.
arXiv Detail & Related papers (2024-07-24T02:27:36Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
Automated DevOps Pipeline Generation for Code Repositories using Large Language Models [5.011328607647701]
The research scrutinizes the proficiency of GPT 3.5 and GPT 4 in generating GitHub, while assessing the influence of various prompt elements in constructing the most efficient pipeline. Results indicate substantial advancements in GPT 4. The research introduces a GitHub App built on Probot, empowering users to automate workflow generation within GitHub ecosystem.
arXiv Detail & Related papers (2023-12-20T17:47:52Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE [83.00018517368973]
Large Language Models (LLMs) can extend their zero-shot capabilities to multimodal learning through instruction tuning. negative conflicts and interference may have a worse impact on performance. We combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning.
arXiv Detail & Related papers (2023-11-05T15:48:29Z)
AskIt: Unified Programming Interface for Programming with Large Language Models [0.0]
Large Language Models (LLMs) exhibit a unique phenomenon known as emergent abilities, demonstrating adeptness across numerous tasks. This paper introduces AskIt, a domain-specific language specifically designed for LLMs. Across 50 tasks, AskIt generated concise prompts, achieving a 16.14 % reduction in prompt length compared to benchmarks.
arXiv Detail & Related papers (2023-08-29T21:44:27Z)
Low-code LLM: Graphical User Interface over Large Language Models [115.08718239772107]
This paper introduces a novel human-LLM interaction framework, Low-code LLM. It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses. We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability.
arXiv Detail & Related papers (2023-04-17T09:27:40Z)
A Preliminary Investigation of MLOps Practices in GitHub [10.190501703364234]
Machine learning applications have led to an increasing interest in MLOps. We present an initial investigation of the MLOps practices implemented in a set of ML-enabled systems retrieved from GitHub.
arXiv Detail & Related papers (2022-09-23T07:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.