Exploring the Responses of Large Language Models to Beginner
Programmers' Help Requests
- URL: http://arxiv.org/abs/2306.05715v1
- Date: Fri, 9 Jun 2023 07:19:43 GMT
- Title: Exploring the Responses of Large Language Models to Beginner
Programmers' Help Requests
- Authors: Arto Hellas, Juho Leinonen, Sami Sarsa, Charles Koutcheme, Lilja
Kujanp\"a\"a, Juha Sorva
- Abstract summary: We assess how good large language models (LLMs) are at identifying issues in problematic code that students request help on.
We collected a sample of help requests and code from an online programming course.
- Score: 1.8260333137469122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background and Context: Over the past year, large language models (LLMs) have
taken the world by storm. In computing education, like in other walks of life,
many opportunities and threats have emerged as a consequence.
Objectives: In this article, we explore such opportunities and threats in a
specific area: responding to student programmers' help requests. More
specifically, we assess how good LLMs are at identifying issues in problematic
code that students request help on.
Method: We collected a sample of help requests and code from an online
programming course. We then prompted two different LLMs (OpenAI Codex and
GPT-3.5) to identify and explain the issues in the students' code and assessed
the LLM-generated answers both quantitatively and qualitatively.
Findings: GPT-3.5 outperforms Codex in most respects. Both LLMs frequently
find at least one actual issue in each student program (GPT-3.5 in 90% of the
cases). Neither LLM excels at finding all the issues (GPT-3.5 finding them 57%
of the time). False positives are common (40% chance for GPT-3.5). The advice
that the LLMs provide on the issues is often sensible. The LLMs perform better
on issues involving program logic rather than on output formatting. Model
solutions are frequently provided even when the LLM is prompted not to. LLM
responses to prompts in a non-English language are only slightly worse than
responses to English prompts.
Implications: Our results continue to highlight the utility of LLMs in
programming education. At the same time, the results highlight the
unreliability of LLMs: LLMs make some of the same mistakes that students do,
perhaps especially when formatting output as required by automated assessment
systems. Our study informs teachers interested in using LLMs as well as future
efforts to customize LLMs for the needs of programming education.
Related papers
- Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically.
The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z) - SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.02175403852253]
One common use of Large Language Models (LLMs) is performing tasks on scientific topics.
Inspired by the way university students are evaluated on such tasks, we propose SciEx - a benchmark consisting of university computer science exam questions.
We evaluate the performance of various state-of-the-art LLMs on our new benchmark.
arXiv Detail & Related papers (2024-06-14T21:52:21Z) - CS1-LLM: Integrating LLMs into CS1 Instruction [0.6282171844772422]
This experience report describes a CS1 course at a large research-intensive university that fully embraces the use of Large Language Models.
To incorporate the LLMs, the course was intentionally altered to reduce emphasis on syntax and writing code from scratch.
Students were given three large, open-ended projects in three separate domains that allowed them to showcase their creativity.
arXiv Detail & Related papers (2024-04-17T14:44:28Z) - Reasoning on Efficient Knowledge Paths:Knowledge Graph Guides Large Language Model for Domain Question Answering [18.94220625114711]
Large language models (LLMs) perform surprisingly well and outperform human experts on many tasks.
This paper integrates and optimized a pipeline for selecting reasoning paths from KG based on LLM.
We also propose a simple and effective subgraph retrieval method based on chain of thought (CoT) and page rank.
arXiv Detail & Related papers (2024-04-16T08:28:16Z) - Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model.
We employ a proxy model which has far fewer parameters, and take its answers as answers.
Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - A Survey of Large Language Models for Code: Evolution, Benchmarking, and
Future Trends [30.774685501251817]
General large language models (LLMs) have demonstrated significant potential in tasks such as code generation in software engineering.
A considerable portion of Code LLMs is derived from general LLMs through model fine-tuning.
There is currently a lack of systematic investigation into Code LLMs and their performance.
arXiv Detail & Related papers (2023-11-17T07:55:16Z) - Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method [36.24876571343749]
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.
Recent literature reveals that LLMs generate nonfactual responses intermittently.
We propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results.
arXiv Detail & Related papers (2023-10-27T06:22:14Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs.
We evaluate 12 widely used LLMs, including both general-purpose and specialized models.
GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.