Investigating the Impact of Vocabulary Difficulty and Code Naturalness
on Program Comprehension
- URL: http://arxiv.org/abs/2308.13429v1
- Date: Fri, 25 Aug 2023 15:15:00 GMT
- Title: Investigating the Impact of Vocabulary Difficulty and Code Naturalness
on Program Comprehension
- Authors: Bin Lin, Gregorio Robles
- Abstract summary: This study aims to assess readability and understandability from the perspective of language acquisition.
We will conduct a statistical analysis to understand their correlations and analyze whether code naturalness and vocabulary difficulty can be used to improve the performance of readability and understandability prediction methods.
- Score: 3.35803394416914
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Context: Developers spend most of their time comprehending source code during
software development. Automatically assessing how readable and understandable
source code is can provide various benefits in different tasks, such as task
triaging and code reviews. While several studies have proposed approaches to
predict software readability and understandability, most of them only focus on
local characteristics of source code. Besides, the performance of
understandability prediction is far from satisfactory.
Objective: In this study, we aim to assess readability and understandability
from the perspective of language acquisition. More specifically, we would like
to investigate whether code readability and understandability are correlated
with the naturalness and vocabulary difficulty of source code.
Method: To assess code naturalness, we adopted the cross-entropy metric,
while we use a manually crafted list of code elements with their assigned
advancement levels to assess the vocabulary difficulty. We will conduct a
statistical analysis to understand their correlations and analyze whether code
naturalness and vocabulary difficulty can be used to improve the performance of
code readability and understandability prediction methods. The study will be
conducted on existing datasets.
Related papers
- Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub.
83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z) - Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes [17.95094238686012]
Language models (LMs) have exhibited impressive abilities in generating codes from natural language requirements.
We highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation capabilities.
We propose a systematic approach to evaluate the diversity of generated code, utilizing various metrics for inter-code similarity as well as functional correctness.
arXiv Detail & Related papers (2024-08-24T07:40:22Z) - When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM [6.417777780911223]
Code comments play a crucial role in software development, as they provide programmers with practical information.
Developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts.
It is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code.
arXiv Detail & Related papers (2024-05-25T15:21:27Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code.
We find that code prompting exhibits a high-performance boost for multiple LLMs.
Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z) - Source Code Comprehension: A Contemporary Definition and Conceptual
Model for Empirical Investigation [5.139874302398955]
The research community has not managed to define source code comprehension as a concept.
An implicit definition by task prevails, i.e., code comprehension is what the experimental tasks measure.
This paper constitutes a reference work that defines source code comprehension and presents a conceptual framework.
arXiv Detail & Related papers (2023-10-17T14:23:46Z) - Generating Summaries with Controllable Readability Levels [67.34087272813821]
Several factors affect the readability level, such as the complexity of the text, its subject matter, and the reader's background knowledge.
Current text generation approaches lack refined control, resulting in texts that are not customized to readers' proficiency levels.
We develop three text generation techniques for controlling readability: instruction-based readability control, reinforcement learning to minimize the gap between requested and observed readability, and a decoding approach that uses look-ahead to estimate the readability of upcoming decoding steps.
arXiv Detail & Related papers (2023-10-16T17:46:26Z) - Understanding Programs by Exploiting (Fuzzing) Test Cases [26.8259045248779]
We propose to incorporate the relationship between inputs and possible outputs/behaviors into learning, for achieving a deeper semantic understanding of programs.
To obtain inputs that are representative enough to trigger the execution of most part of the code, we resort to fuzz testing and propose fuzz tuning.
The effectiveness of the proposed method is verified on two program understanding tasks including code clone detection and code classification, and it outperforms current state-of-the-arts by large margins.
arXiv Detail & Related papers (2023-05-23T01:51:46Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - The Mind Is a Powerful Place: How Showing Code Comprehensibility Metrics
Influences Code Understanding [10.644832702859484]
We investigate whether a displayed metric value for source code comprehensibility anchors developers in their subjective rating of source code comprehensibility.
We found that the displayed value of a comprehensibility metric has a significant and large anchoring effect on a developer's code comprehensibility rating.
arXiv Detail & Related papers (2020-12-16T14:27:45Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.