Do Machines and Humans Focus on Similar Code? Exploring Explainability
of Large Language Models in Code Summarization
- URL: http://arxiv.org/abs/2402.14182v1
- Date: Thu, 22 Feb 2024 00:01:02 GMT
- Title: Do Machines and Humans Focus on Similar Code? Exploring Explainability
of Large Language Models in Code Summarization
- Authors: Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach,
Yu Huang
- Abstract summary: We report negative results from our investigation of explainability of language models in code summarization through the lens of human comprehension.
We employ a state-of-the-art model-agnostic, black-box, perturbation-based approach, SHAP, to identify which code tokens influence that generation of summaries.
Our study highlights an inability to align human focus with SHAP-based model focus measures.
- Score: 10.201463330812167
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent language models have demonstrated proficiency in summarizing source
code. However, as in many other domains of machine learning, language models of
code lack sufficient explainability. Informally, we lack a formulaic or
intuitive understanding of what and how models learn from code. Explainability
of language models can be partially provided if, as the models learn to produce
higher-quality code summaries, they also align in deeming the same code parts
important as those identified by human programmers. In this paper, we report
negative results from our investigation of explainability of language models in
code summarization through the lens of human comprehension. We measure human
focus on code using eye-tracking metrics such as fixation counts and duration
in code summarization tasks. To approximate language model focus, we employ a
state-of-the-art model-agnostic, black-box, perturbation-based approach, SHAP
(SHapley Additive exPlanations), to identify which code tokens influence that
generation of summaries. Using these settings, we find no statistically
significant relationship between language models' focus and human programmers'
attention. Furthermore, alignment between model and human foci in this setting
does not seem to dictate the quality of the LLM-generated summaries. Our study
highlights an inability to align human focus with SHAP-based model focus
measures. This result calls for future investigation of multiple open questions
for explainable language models for code summarization and software engineering
tasks in general, including the training mechanisms of language models for
code, whether there is an alignment between human and model attention on code,
whether human attention can improve the development of language models, and
what other model focus measures are appropriate for improving explainability.
Related papers
- Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - Curriculum Learning for Small Code Language Models [0.09999629695552192]
This paper explores the potential of curriculum learning in enhancing the performance of code language models.
We demonstrate that a well-designed curriculum learning approach significantly improves the accuracy of small decoder-only code language models.
arXiv Detail & Related papers (2024-07-14T13:32:24Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Diffusion Language Models Can Perform Many Tasks with Scaling and
Instruction-Finetuning [56.03057119008865]
We show that scaling diffusion language models can effectively make them strong language learners.
We build competent diffusion language models at scale by first acquiring knowledge from massive data.
Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks.
arXiv Detail & Related papers (2023-08-23T16:01:12Z) - Towards Understanding What Code Language Models Learned [10.989953856458996]
Pre-trained language models are effective in a variety of natural language tasks.
It has been argued their capabilities fall short of fully learning meaning or understanding language.
We investigate their ability to capture semantics of code beyond superficial frequency and co-occurrence.
arXiv Detail & Related papers (2023-06-20T23:42:14Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in
Natural Language Understanding [1.827510863075184]
Curriculum is a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena.
We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
arXiv Detail & Related papers (2022-04-13T10:32:03Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.