Related papers: StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

URL: http://arxiv.org/abs/2306.04556v1
Date: Wed, 7 Jun 2023 16:03:55 GMT
Title: StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code
Authors: Hannah McLean Babe, Sydney Nguyen, Yangtian Zi, Arjun Guha, Molly Q Feldman, Carolyn Jane Anderson
Abstract summary: StudentEval contains 1,749 prompts for 48 problems, written by 80 students who have only completed one semester of Python programming. We analyze the prompts and find significant variation in students' prompting techniques.
Score: 2.087827281461409
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code LLMs are being rapidly deployed and there is evidence that they can make professional programmers more productive. Current benchmarks for code generation measure whether models generate correct programs given an expert prompt. In this paper, we present a new benchmark containing multiple prompts per problem, written by a specific population of non-expert prompters: beginning programmers. StudentEval contains 1,749 prompts for 48 problems, written by 80 students who have only completed one semester of Python programming. Our students wrote these prompts while working interactively with a Code LLM, and we observed very mixed success rates. We use StudentEval to evaluate 5 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. We analyze the prompts and find significant variation in students' prompting techniques. We also find that nondeterministic LLM sampling could mislead students into thinking that their prompts are more (or less) effective than they actually are, which has implications for how to teach with Code LLMs.

Related papers

"I Would Have Written My Code Differently'': Beginners Struggle to Understand LLM-Generated Code [3.125508434341366]
This paper measures how well beginners comprehend large language models (LLMs) generated code. Key challenges include barriers for non-native English speakers, unfamiliarity with Python syntax, and automation bias. Our results show a low per-task success rate of 32.5%, with indiscriminate struggles across demographic populations.
arXiv Detail & Related papers (2025-04-26T22:12:16Z)
Substance Beats Style: Why Beginning Students Fail to Code with LLMs [3.4817709155395327]
Existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. This paper explores two competing hypotheses about the cause of student-LLM miscommunication.
arXiv Detail & Related papers (2024-10-15T20:36:30Z)
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.02175403852253]
One common use of Large Language Models (LLMs) is performing tasks on scientific topics. Inspired by the way university students are evaluated on such tasks, we propose SciEx - a benchmark consisting of university computer science exam questions. We evaluate the performance of various state-of-the-art LLMs on our new benchmark.
arXiv Detail & Related papers (2024-06-14T21:52:21Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
A Prompt Learning Framework for Source Code Summarization [24.33455799484519]
We propose a novel prompt learning framework for code summarization called PromptCS. PromptCS trains a prompt agent that can generate continuous prompts to unleash the potential for LLMs in code summarization. We evaluate PromptCS on the CodeSearchNet dataset involving multiple programming languages.
arXiv Detail & Related papers (2023-12-26T14:37:55Z)
ProCoT: Stimulating Critical Thinking and Writing of Students through Engagement with Large Language Models (LLMs) [0.7545833157486899]
We introduce a novel writing method called Probing Chain-of-Thought (ProCoT) It potentially prevents students from cheating using a Large Language Model (LLM) We conduct studies with ProCoT in two different courses with 65 students.
arXiv Detail & Related papers (2023-12-15T14:01:46Z)
AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models. It achieves consistent and correct step-wise prompts in zero-shot scenarios. We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z)
Testing LLMs on Code Generation with Varying Levels of Prompt Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing. The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z)
Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests [1.8260333137469122]
We assess how good large language models (LLMs) are at identifying issues in problematic code that students request help on. We collected a sample of help requests and code from an online programming course.
arXiv Detail & Related papers (2023-06-09T07:19:43Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.