ReDefining Code Comprehension: Function Naming as a Mechanism for Evaluating Code Comprehension
- URL: http://arxiv.org/abs/2503.12207v1
- Date: Sat, 15 Mar 2025 17:22:14 GMT
- Title: ReDefining Code Comprehension: Function Naming as a Mechanism for Evaluating Code Comprehension
- Authors: David H. Smith IV, Max Fowler, Paul Denny, Craig Zilles,
- Abstract summary: "Explain in Plain English" (EiPE) questions are widely used to assess code comprehension skills.<n>Recent approaches like Code Generation Based Grading (CGBG) leverage large language models to generate code.<n>We propose a modified approach where students generate function names, emphasizing the function's purpose over implementation details.
- Score: 2.250363093539224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: "Explain in Plain English" (EiPE) questions are widely used to assess code comprehension skills but are challenging to grade automatically. Recent approaches like Code Generation Based Grading (CGBG) leverage large language models (LLMs) to generate code from student explanations and validate its equivalence to the original code using unit tests. However, this approach does not differentiate between high-level, purpose-focused responses and low-level, implementation-focused ones, limiting its effectiveness in assessing comprehension level. We propose a modified approach where students generate function names, emphasizing the function's purpose over implementation details. We evaluate this method in an introductory programming course and analyze it using Item Response Theory (IRT) to understand its effectiveness as exam items and its alignment with traditional EiPE grading standards. We also publish this work as an open source Python package for autograding EiPE questions, providing a scalable solution for adoption.
Related papers
- On Explaining (Large) Language Models For Code Using Global Code-Based Explanations [45.126233498200534]
Language Models for Code (LLM4Code) have significantly changed the landscape of software engineering (SE)
We introduce code rationales (Code$Q$), a technique with rigorous mathematical underpinning, to identify subsets of tokens that can explain individual code predictions.
Our evaluation demonstrates that Code$Q$ is a powerful interpretability method to explain how (less) meaningful input concepts (i.e., natural language particle at') highly impact output generation.
arXiv Detail & Related papers (2025-03-21T01:00:45Z) - Counting the Trees in the Forest: Evaluating Prompt Segmentation for Classifying Code Comprehension Level [2.250363093539224]
This paper introduces a novel method for automatically assessing the comprehension level of responses to Explain in Plain English'' questions.<n>Using a Large Language Model (LLM) to segment both the student's description and the code, we aim to determine whether the student describes each line individually (many segments) or the code as a whole.
arXiv Detail & Related papers (2025-03-15T17:57:38Z) - Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization [35.159417478678286]
There is a significant lack of research on summarizing higher-level code units, such as file-level and module-level code units.<n>We explore various summarization strategies for ACS of higher-level code units, which can be divided into three types: full code summarization, reduced code summarization, and hierarchical code summarization.
arXiv Detail & Related papers (2025-03-13T16:15:06Z) - Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z) - Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance.<n>We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.
We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Explaining Code with a Purpose: An Integrated Approach for Developing
Code Comprehension and Prompting Skills [4.776920192249936]
We propose using an LLM to generate code based on students' responses to EiPE questions.
We report student success in creating effective prompts for solving EiPE questions.
arXiv Detail & Related papers (2024-03-10T00:23:08Z) - Introducing User Feedback-based Counterfactual Explanations (UFCE) [49.1574468325115]
Counterfactual explanations (CEs) have emerged as a viable solution for generating comprehensible explanations in XAI.
UFCE allows for the inclusion of user constraints to determine the smallest modifications in the subset of actionable features.
UFCE outperforms two well-known CE methods in terms of textitproximity, textitsparsity, and textitfeasibility.
arXiv Detail & Related papers (2024-02-26T20:09:44Z) - Code Generation Based Grading: Evaluating an Auto-grading Mechanism for
"Explain-in-Plain-English" Questions [0.0]
"Code Generation Based Grading" (CGBG) achieves moderate agreement with human graders.
CGBG achieves moderate agreement with human graders with respect to low-level and line-by-line descriptions of code.
arXiv Detail & Related papers (2023-11-25T02:45:00Z) - FIND: A Function Description Benchmark for Evaluating Interpretability
Methods [86.80718559904854]
This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating automated interpretability methods.
FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate.
We evaluate methods that use pretrained language models to produce descriptions of function behavior in natural language and code.
arXiv Detail & Related papers (2023-09-07T17:47:26Z) - Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification.
Our strategy enables important aspects of the base learner objective to be learned during meta-training.
We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.