Related papers: Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code

Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code

URL: http://arxiv.org/abs/2310.10508v2
Date: Wed, 19 Feb 2025 22:37:08 GMT
Title: Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code
Authors: Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, Hadi Hemmati,
Abstract summary: We evaluate GPT-4 using three prompt engineering strategies -- basic prompting, in-context learning, and task-specific prompting.<n>We compare it against 17 fine-tuned models across three code-related tasks: code summarization, generation, and translation.
Score: 7.760653867600283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancements in large language models (LLMs) have greatly expanded the potential for automated code-related tasks. Two primary methodologies are used in this domain: prompt engineering and fine-tuning. Prompt engineering involves applying different strategies to query LLMs, like ChatGPT, while fine-tuning further adapts pre-trained models, such as CodeBERT, by training them on task-specific data. Despite the growth in the area, there remains a lack of comprehensive comparative analysis between the approaches for code models. In this paper, we evaluate GPT-4 using three prompt engineering strategies -- basic prompting, in-context learning, and task-specific prompting -- and compare it against 17 fine-tuned models across three code-related tasks: code summarization, generation, and translation. Our results indicate that GPT-4 with prompt engineering does not consistently outperform fine-tuned models. For instance, in code generation, GPT-4 is outperformed by fine-tuned models by 28.3% points on the MBPP dataset. It also shows mixed results for code translation tasks. Additionally, a user study was conducted involving 27 graduate students and 10 industry practitioners. The study revealed that GPT-4 with conversational prompts, incorporating human feedback during interaction, significantly improved performance compared to automated prompting. Participants often provided explicit instructions or added context during these interactions. These findings suggest that GPT-4 with conversational prompting holds significant promise for automated code-related tasks, whereas fully automated prompt engineering without human involvement still requires further investigation.

Related papers

Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z)
Automatic Generation of Question Hints for Mathematics Problems using Large Language Models in Educational Technology [17.91379291654773]
This work explores using Large Language Models (LLMs) as teachers to generate effective hints for students simulated through LLMs. The results show that model errors increase with higher temperature settings. Interestingly, Llama-3-8B-Instruct as a teacher showed better overall performance than GPT-4o.
arXiv Detail & Related papers (2024-11-05T20:18:53Z)
Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions [2.0411082897313984]
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math. By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated'student' model.
arXiv Detail & Related papers (2024-06-20T00:25:43Z)
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models [110.45794710162241]
Existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs to synthesize massive math problems. We propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data. We leverage it to synthesize 6 million math problems for pre-training our JiuZhang3.0 model, which only needs to invoke GPT-4 API 9.3k times and pre-train on 4.6B data.
arXiv Detail & Related papers (2024-05-23T09:43:19Z)
Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z)
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies [47.129504708849446]
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing. LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution. In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available.
arXiv Detail & Related papers (2024-02-27T10:44:52Z)
Fine-Tuning and Prompt Engineering for Large Language Models-based Code Review Automation [4.941630596191807]
Fine-tuning and prompting are common approaches to leveraging Large Language Models (LLMs) for code review automation. We use model fine-tuning and inference techniques (i.e., zero-shot learning, few-shot learning and persona) on LLMs-based code review automation. Our results show that GPT-3.5 with zero-shot learning achieves 73.17% -74.23% higher EM than the Guo et al.'s approach.
arXiv Detail & Related papers (2024-02-01T03:10:26Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
Automated DevOps Pipeline Generation for Code Repositories using Large Language Models [5.011328607647701]
The research scrutinizes the proficiency of GPT 3.5 and GPT 4 in generating GitHub, while assessing the influence of various prompt elements in constructing the most efficient pipeline. Results indicate substantial advancements in GPT 4. The research introduces a GitHub App built on Probot, empowering users to automate workflow generation within GitHub ecosystem.
arXiv Detail & Related papers (2023-12-20T17:47:52Z)
Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance [11.595274304409937]
Large language models (LLMs) have revolutionized zero-shot task performance. Current methods using trigger phrases such as "Let's think step by step" remain limited. This study introduces PRomPTed, an approach that optimize the zero-shot prompts for individual task instances.
arXiv Detail & Related papers (2023-10-03T14:51:34Z)
Code Soliloquies for Accurate Calculations in Large Language Models [22.1024285108075]
High-quality conversational datasets are crucial for the successful development of Intelligent Tutoring Systems. These datasets are generated using advanced GPT-4 models. Our design orchestrates a mock conversation where both student and tutorbot roles are simulated by GPT-4. Our approach notably enhances the quality of synthetic conversation datasets, especially for subjects that are calculation-intensive.
arXiv Detail & Related papers (2023-09-21T15:16:58Z)
Generalized Planning in PDDL Domains with Pretrained Large Language Models [82.24479434984426]
We consider PDDL domains and use GPT-4 to synthesize Python programs. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
arXiv Detail & Related papers (2023-05-18T14:48:20Z)
AutoML-GPT: Automatic Machine Learning with GPT [74.30699827690596]
We propose developing task-oriented prompts and automatically utilizing large language models (LLMs) to automate the training pipeline. We present the AutoML-GPT, which employs GPT as the bridge to diverse AI models and dynamically trains models with optimized hyper parameters. This approach achieves remarkable results in computer vision, natural language processing, and other challenging areas.
arXiv Detail & Related papers (2023-05-04T02:09:43Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting. The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z)
TEMPERA: Test-Time Prompting via Reinforcement Learning [57.48657629588436]
We propose Test-time Prompt Editing using Reinforcement learning (TEMPERA) In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge. Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods.
arXiv Detail & Related papers (2022-11-21T22:38:20Z)
Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models. Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity. The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.