Prompt Engineering or Fine Tuning: An Empirical Assessment of Large
Language Models in Automated Software Engineering Tasks
- URL: http://arxiv.org/abs/2310.10508v1
- Date: Wed, 11 Oct 2023 00:21:00 GMT
- Title: Prompt Engineering or Fine Tuning: An Empirical Assessment of Large
Language Models in Automated Software Engineering Tasks
- Authors: Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang,
Hadi Hemmati
- Abstract summary: GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies.
fully automated prompt engineering with no human in the loop requires more study and improvement.
- Score: 8.223311621898983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the effectiveness of state-of-the-art LLM,
i.e., GPT-4, with three different prompting engineering techniques (i.e., basic
prompting, in-context learning, and task-specific prompting) against 18
fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code
summarization, and code translation. Our quantitative analysis of these
prompting strategies suggests that prompt engineering GPT-4 cannot necessarily
and significantly outperform fine-tuning smaller/older LLMs in all three tasks.
For comment generation, GPT-4 with the best prompting strategy (i.e.,
task-specific prompt) had outperformed the first-ranked fine-tuned model by
8.33% points on average in BLEU. However, for code generation, the first-ranked
fine-tuned model outperforms GPT-4 with best prompting by 16.61% and 28.3%
points, on average in BLEU. For code translation, GPT-4 and fine-tuned
baselines tie as they outperform each other on different translation tasks. To
explore the impact of different prompting strategies, we conducted a user study
with 27 graduate students and 10 industry practitioners. From our qualitative
analysis, we find that the GPT-4 with conversational prompts (i.e., when a
human provides feedback and instructions back and forth with a model to achieve
best results) showed drastic improvement compared to GPT-4 with automatic
prompting strategies. Moreover, we observe that participants tend to request
improvements, add more context, or give specific instructions as conversational
prompts, which goes beyond typical and generic prompting strategies. Our study
suggests that, at its current state, GPT-4 with conversational prompting has
great potential for ASE tasks, but fully automated prompt engineering with no
human in the loop requires more study and improvement.
Related papers
- Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models [110.45794710162241]
Existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs to synthesize massive math problems.
We propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data.
We leverage it to synthesize 6 million math problems for pre-training our JiuZhang3.0 model, which only needs to invoke GPT-4 API 9.3k times and pre-train on 4.6B data.
arXiv Detail & Related papers (2024-05-23T09:43:19Z) - Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input.
The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z) - Fine-Tuning and Prompt Engineering for Large Language Models-based Code Review Automation [4.941630596191807]
Fine-tuning and prompting are common approaches to leveraging Large Language Models (LLMs) for code review automation.
We use model fine-tuning and inference techniques (i.e., zero-shot learning, few-shot learning and persona) on LLMs-based code review automation.
Our results show that GPT-3.5 with zero-shot learning achieves 73.17% -74.23% higher EM than the Guo et al.'s approach.
arXiv Detail & Related papers (2024-02-01T03:10:26Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - Automated DevOps Pipeline Generation for Code Repositories using Large
Language Models [5.011328607647701]
The research scrutinizes the proficiency of GPT 3.5 and GPT 4 in generating GitHub, while assessing the influence of various prompt elements in constructing the most efficient pipeline.
Results indicate substantial advancements in GPT 4.
The research introduces a GitHub App built on Probot, empowering users to automate workflow generation within GitHub ecosystem.
arXiv Detail & Related papers (2023-12-20T17:47:52Z) - Code Soliloquies for Accurate Calculations in Large Language Models [22.1024285108075]
High-quality conversational datasets are crucial for the successful development of Intelligent Tutoring Systems.
These datasets are generated using advanced GPT-4 models.
Our design orchestrates a mock conversation where both student and tutorbot roles are simulated by GPT-4.
Our approach notably enhances the quality of synthetic conversation datasets, especially for subjects that are calculation-intensive.
arXiv Detail & Related papers (2023-09-21T15:16:58Z) - Generalized Planning in PDDL Domains with Pretrained Large Language
Models [82.24479434984426]
We consider PDDL domains and use GPT-4 to synthesize Python programs.
We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
arXiv Detail & Related papers (2023-05-18T14:48:20Z) - AutoML-GPT: Automatic Machine Learning with GPT [74.30699827690596]
We propose developing task-oriented prompts and automatically utilizing large language models (LLMs) to automate the training pipeline.
We present the AutoML-GPT, which employs GPT as the bridge to diverse AI models and dynamically trains models with optimized hyper parameters.
This approach achieves remarkable results in computer vision, natural language processing, and other challenging areas.
arXiv Detail & Related papers (2023-05-04T02:09:43Z) - Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models.
Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity.
The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.