Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4,
and Human Tutors
- URL: http://arxiv.org/abs/2306.17156v3
- Date: Tue, 1 Aug 2023 00:03:25 GMT
- Title: Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4,
and Human Tutors
- Authors: Tung Phung, Victor-Alexandru P\u{a}durean, Jos\'e Cambronero, Sumit
Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, Gustavo Soares
- Abstract summary: We systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios.
Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios.
- Score: 21.227955181065948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative AI and large language models hold great promise in enhancing
computing education by powering next-generation educational technologies for
introductory programming. Recent works have studied these models for different
scenarios relevant to programming education; however, these works are limited
for several reasons, as they typically consider already outdated models or only
specific scenario(s). Consequently, there is a lack of a systematic study that
benchmarks state-of-the-art models for a comprehensive set of programming
education scenarios. In our work, we systematically evaluate two models,
ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human
tutors for a variety of scenarios. We evaluate using five introductory Python
programming problems and real-world buggy programs from an online platform, and
assess performance using expert-based annotations. Our results show that GPT-4
drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human
tutors' performance for several scenarios. These results also highlight
settings where GPT-4 still struggles, providing exciting future directions on
developing techniques to improve the performance of these models.
Related papers
- Comparing large language models and human programmers for generating programming code [0.0]
GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2.
In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants.
arXiv Detail & Related papers (2024-03-01T14:43:06Z) - Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation [25.317788211120362]
We investigate the role of generative AI models in providing human tutor-style programming hints.
Recent works have benchmarked state-of-the-art models for various feedback generation scenarios.
We develop a novel technique, GPT4Hints-GPT3.5Val, to push the limits of generative AI models.
arXiv Detail & Related papers (2023-10-05T17:02:59Z) - Evaluating ChatGPT and GPT-4 for Visual Programming [20.64766977405438]
We evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, in visual programming domains for various scenarios.
Our results show that these models perform poorly and struggle to combine spatial, logical, and programming skills crucial for visual programming.
arXiv Detail & Related papers (2023-07-30T22:13:20Z) - Thrilled by Your Progress! Large Language Models (GPT-4) No Longer
Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement.
This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data.
We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more.
We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z) - A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models [71.42197262495056]
GPT series models have gained considerable attention due to their exceptional natural language processing capabilities.
We select six representative models, comprising two GPT-3 series models and four GPT-3.5 series models.
We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets.
Our experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve.
arXiv Detail & Related papers (2023-03-18T14:02:04Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Guiding Generative Language Models for Data Augmentation in Few-Shot
Text Classification [59.698811329287174]
We leverage GPT-2 for generating artificial training instances in order to improve classification performance.
Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements.
arXiv Detail & Related papers (2021-11-17T12:10:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.