Kattis vs. ChatGPT: Assessment and Evaluation of Programming Tasks in
the Age of Artificial Intelligence
- URL: http://arxiv.org/abs/2312.01109v1
- Date: Sat, 2 Dec 2023 11:09:17 GMT
- Title: Kattis vs. ChatGPT: Assessment and Evaluation of Programming Tasks in
the Age of Artificial Intelligence
- Authors: Nora Dunder, Saga Lundborg, Olga Viberg, Jacqueline Wong
- Abstract summary: The effectiveness of using large language models for solving programming tasks has been underexplored.
The present study examines ChatGPT's ability to generate code solutions at different difficulty levels for introductory programming courses.
Results contribute to the ongoing debate on the utility of AI-powered tools in programming education.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI-powered education technologies can support students and teachers in
computer science education. However, with the recent developments in generative
AI, and especially the increasingly emerging popularity of ChatGPT, the
effectiveness of using large language models for solving programming tasks has
been underexplored. The present study examines ChatGPT's ability to generate
code solutions at different difficulty levels for introductory programming
courses. We conducted an experiment where ChatGPT was tested on 127 randomly
selected programming problems provided by Kattis, an automatic software grading
tool for computer science programs, often used in higher education. The results
showed that ChatGPT independently could solve 19 out of 127 programming tasks
generated and assessed by Kattis. Further, ChatGPT was found to be able to
generate accurate code solutions for simple problems but encountered
difficulties with more complex programming tasks. The results contribute to the
ongoing debate on the utility of AI-powered tools in programming education.
Related papers
- Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.
GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.
Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z) - Beyond the Hype: A Cautionary Tale of ChatGPT in the Programming Classroom [0.0]
The paper provides insights for academics who teach programming to create more challenging exercises and how to engage responsibly in the use of ChatGPT to promote classroom integrity.
We analyzed the various practical programming examples from past IS exercises and compared those with memos created by tutors and lecturers in a university setting.
arXiv Detail & Related papers (2024-06-16T23:52:37Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Refining ChatGPT-Generated Code: Characterizing and Mitigating Code
Quality Issues [17.7880460531813]
We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages.
We identify and characterize potential issues with the quality of ChatGPT-generated code.
We find that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement.
arXiv Detail & Related papers (2023-07-24T08:14:22Z) - Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures [0.6990493129893112]
We evaluate ChatGPT's ability to generate correct solutions to the problems fed to it, its code quality, and nature of run-time errors thrown by its code.
We look into patterns in the test cases passed in order to gain some insights into how wrong ChatGPT code is in these kinds of situations.
arXiv Detail & Related papers (2023-07-10T08:20:34Z) - ChatGPT, Can You Generate Solutions for my Coding Exercises? An
Evaluation on its Effectiveness in an undergraduate Java Programming Course [4.779196219827508]
ChatGPT is a large-scale, deep learning-driven natural language processing model.
Our evaluation involves analyzing ChatGPT-generated solutions for 80 diverse programming exercises.
arXiv Detail & Related papers (2023-05-23T04:38:37Z) - Exploring the Use of ChatGPT as a Tool for Learning and Assessment in
Undergraduate Computer Science Curriculum: Opportunities and Challenges [0.3553493344868413]
This paper addresses the prospects and obstacles associated with utilizing ChatGPT as a tool for learning and assessment in undergraduate Computer Science curriculum.
Group B students were given access to ChatGPT and were encouraged to use it to help solve the programming challenges.
Results show that students using ChatGPT had an advantage in terms of earned scores, however there were inconsistencies and inaccuracies in the submitted code.
arXiv Detail & Related papers (2023-04-16T21:04:52Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - Competition-Level Code Generation with AlphaCode [74.87216298566942]
We introduce AlphaCode, a system for code generation that can create novel solutions to problems that require deeper reasoning.
In simulated evaluations on recent programming competitions on the Codeforces platform, AlphaCode achieved on average a ranking of top 54.3%.
arXiv Detail & Related papers (2022-02-08T23:16:31Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.