Prompt Engineering or Fine Tuning: An Empirical Assessment of Large
Language Models in Automated Software Engineering Tasks
- URL: http://arxiv.org/abs/2310.10508v1
- Date: Wed, 11 Oct 2023 00:21:00 GMT
- Title: Prompt Engineering or Fine Tuning: An Empirical Assessment of Large
Language Models in Automated Software Engineering Tasks
- Authors: Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang,
Hadi Hemmati
- Abstract summary: GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies.
fully automated prompt engineering with no human in the loop requires more study and improvement.
- Score: 8.223311621898983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the effectiveness of state-of-the-art LLM,
i.e., GPT-4, with three different prompting engineering techniques (i.e., basic
prompting, in-context learning, and task-specific prompting) against 18
fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code
summarization, and code translation. Our quantitative analysis of these
prompting strategies suggests that prompt engineering GPT-4 cannot necessarily
and significantly outperform fine-tuning smaller/older LLMs in all three tasks.
For comment generation, GPT-4 with the best prompting strategy (i.e.,
task-specific prompt) had outperformed the first-ranked fine-tuned model by
8.33% points on average in BLEU. However, for code generation, the first-ranked
fine-tuned model outperforms GPT-4 with best prompting by 16.61% and 28.3%
points, on average in BLEU. For code translation, GPT-4 and fine-tuned
baselines tie as they outperform each other on different translation tasks. To
explore the impact of different prompting strategies, we conducted a user study
with 27 graduate students and 10 industry practitioners. From our qualitative
analysis, we find that the GPT-4 with conversational prompts (i.e., when a
human provides feedback and instructions back and forth with a model to achieve
best results) showed drastic improvement compared to GPT-4 with automatic
prompting strategies. Moreover, we observe that participants tend to request
improvements, add more context, or give specific instructions as conversational
prompts, which goes beyond typical and generic prompting strategies. Our study
suggests that, at its current state, GPT-4 with conversational prompting has
great potential for ASE tasks, but fully automated prompt engineering with no
human in the loop requires more study and improvement.
Related papers
- Automatic Generation of Question Hints for Mathematics Problems using Large Language Models in Educational Technology [17.91379291654773]
This work explores using Large Language Models (LLMs) as teachers to generate effective hints for students simulated through LLMs.
The results show that model errors increase with higher temperature settings.
Interestingly, Llama-3-8B-Instruct as a teacher showed better overall performance than GPT-4o.
arXiv Detail & Related papers (2024-11-05T20:18:53Z) - Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions [2.0411082897313984]
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math.
By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated'student' model.
arXiv Detail & Related papers (2024-06-20T00:25:43Z) - Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input.
The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z) - Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies [47.129504708849446]
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing.
LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution.
In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available.
arXiv Detail & Related papers (2024-02-27T10:44:52Z) - Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance [11.595274304409937]
Large language models (LLMs) have revolutionized zero-shot task performance.
Current methods using trigger phrases such as "Let's think step by step" remain limited.
This study introduces PRomPTed, an approach that optimize the zero-shot prompts for individual task instances.
arXiv Detail & Related papers (2023-10-03T14:51:34Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - TEMPERA: Test-Time Prompting via Reinforcement Learning [57.48657629588436]
We propose Test-time Prompt Editing using Reinforcement learning (TEMPERA)
In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge.
Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods.
arXiv Detail & Related papers (2022-11-21T22:38:20Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models.
Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity.
The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.