Evaluating ChatGPT and GPT-4 for Visual Programming
- URL: http://arxiv.org/abs/2308.02522v1
- Date: Sun, 30 Jul 2023 22:13:20 GMT
- Title: Evaluating ChatGPT and GPT-4 for Visual Programming
- Authors: Adish Singla
- Abstract summary: We evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, in visual programming domains for various scenarios.
Our results show that these models perform poorly and struggle to combine spatial, logical, and programming skills crucial for visual programming.
- Score: 20.64766977405438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative AI and large language models have the potential to drastically
improve the landscape of computing education by automatically generating
personalized feedback and content. Recent works have studied the capabilities
of these models for different programming education scenarios; however, these
works considered only text-based programming, in particular, Python
programming. Consequently, they leave open the question of how well these
models would perform in visual programming domains popularly used for K-8
programming education. The main research question we study is: Do
state-of-the-art generative models show advanced capabilities in visual
programming on par with their capabilities in text-based Python programming? In
our work, we evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, in
visual programming domains for various scenarios and assess performance using
expert-based annotations. In particular, we base our evaluation using reference
tasks from the domains of Hour of Code: Maze Challenge by Code-dot-org and
Karel. Our results show that these models perform poorly and struggle to
combine spatial, logical, and programming skills crucial for visual
programming. These results also provide exciting directions for future work on
developing techniques to improve the performance of generative models in visual
programming.
Related papers
- Evaluating Contextually Personalized Programming Exercises Created with Generative AI [4.046163999707179]
This article reports on a user study conducted in an elective programming course that included contextually personalized programming exercises created with GPT-4.
The results demonstrate that the quality of exercises generated with GPT-4 was generally high.
This suggests that AI-generated programming problems can be a worthwhile addition to introductory programming courses.
arXiv Detail & Related papers (2024-06-11T12:59:52Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation [25.317788211120362]
We investigate the role of generative AI models in providing human tutor-style programming hints.
Recent works have benchmarked state-of-the-art models for various feedback generation scenarios.
We develop a novel technique, GPT4Hints-GPT3.5Val, to push the limits of generative AI models.
arXiv Detail & Related papers (2023-10-05T17:02:59Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Exploring Large Language Model for Graph Data Understanding in Online
Job Recommendations [63.19448893196642]
We present a novel framework that harnesses the rich contextual information and semantic representations provided by large language models to analyze behavior graphs.
By leveraging this capability, our framework enables personalized and accurate job recommendations for individual users.
arXiv Detail & Related papers (2023-07-10T11:29:41Z) - Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4,
and Human Tutors [21.227955181065948]
We systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios.
Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios.
arXiv Detail & Related papers (2023-06-29T17:57:40Z) - Thrilled by Your Progress! Large Language Models (GPT-4) No Longer
Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement.
This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z) - Visual Programming for Text-to-Image Generation and Evaluation [73.12069620086311]
We propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation.
First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.
Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming.
arXiv Detail & Related papers (2023-05-24T16:42:17Z) - Automatic Generation of Programming Exercises and Code Explanations with
Large Language Models [4.947560475228859]
OpenAI Codex is a recent large language model from the GPT-3 family for translating code into natural language.
We explore the natural language generation capabilities of Codex in two different phases of the life of a programming exercise.
We find the majority of this automatically generated content both novel and sensible, and in many cases ready to use as is.
arXiv Detail & Related papers (2022-06-03T11:00:43Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another.
We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.