Context-aware Code Summary Generation
- URL: http://arxiv.org/abs/2408.09006v1
- Date: Fri, 16 Aug 2024 20:15:34 GMT
- Title: Context-aware Code Summary Generation
- Authors: Chia-Yi Su, Aakash Bansal, Yu Huang, Toby Jia-Jun Li, Collin McMillan,
- Abstract summary: Code summary generation is the task of writing natural language descriptions of a section of source code.
Recent advances in Large Language Models (LLMs) and other AI-based technologies have helped make automatic code summarization a reality.
We present an approach for including this context in recent LLM-based code summarization.
- Score: 11.83787165247987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code summary generation is the task of writing natural language descriptions of a section of source code. Recent advances in Large Language Models (LLMs) and other AI-based technologies have helped make automatic code summarization a reality. However, the summaries these approaches write tend to focus on a narrow area of code. The results are summaries that explain what that function does internally, but lack a description of why the function exists or its purpose in the broader context of the program. In this paper, we present an approach for including this context in recent LLM-based code summarization. The input to our approach is a Java method and that project in which that method exists. The output is a succinct English description of why the method exists in the project. The core of our approach is a 350m parameter language model we train, which can be run locally to ensure privacy. We train the model in two steps. First we distill knowledge about code summarization from a large model, then we fine-tune the model using data from a study of human programmer who were asked to write code summaries. We find that our approach outperforms GPT-4 on this task.
Related papers
- Towards Summarizing Code Snippets Using Pre-Trained Transformers [20.982048349530483]
In this work, we take all the steps needed to train a DL model to document code snippets.
Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code.
This unlocked the possibility of building a large-scale dataset of documented code snippets.
arXiv Detail & Related papers (2024-02-01T11:39:19Z) - Learning to Prompt with Text Only Supervision for Vision-Language Models [107.282881515667]
One branch of methods adapts CLIP by learning prompts using visual information.
An alternative approach resorts to training-free methods by generating class descriptions from large language models.
We propose to combine the strengths of both streams by learning prompts using only text data.
arXiv Detail & Related papers (2024-01-04T18:59:49Z) - A Comprehensive Review of State-of-The-Art Methods for Java Code
Generation from Natural Language Text [0.0]
This paper provides a comprehensive review of the evolution and progress of deep learning models in Java code generation task.
We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community.
arXiv Detail & Related papers (2023-06-10T07:27:51Z) - Automatic Semantic Augmentation of Language Model Prompts (for Code
Summarization) [7.699967852459232]
Developers tend to consciously and unconsciously have a collection of semantics facts in mind when working on coding tasks.
One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them inherently capable of doing this simple level of "code analysis"
We evaluate whether automatically augmenting an LLM's prompt with semantic facts explicitly, actually helps.
arXiv Detail & Related papers (2023-04-13T20:49:35Z) - Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task.
We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions.
We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Training Data is More Valuable than You Think: A Simple and Effective
Method by Retrieving from Training Data [82.92758444543689]
Retrieval-based methods have been shown to be effective in NLP tasks via introducing external knowledge.
Surprisingly, we found that REtrieving from the traINing datA (REINA) only can lead to significant gains on multiple NLG and NLU tasks.
Experimental results show that this simple method can achieve significantly better performance on a variety of NLU and NLG tasks.
arXiv Detail & Related papers (2022-03-16T17:37:27Z) - Leveraging Unsupervised Learning to Summarize APIs Discussed in Stack
Overflow [1.8047694351309207]
This paper proposes an automatic and novel approach for summarizing Android API methods discussed in Stack Overflow.
Our approach takes the API method's name as an input and generates a natural language summary based on Stack Overflow discussions of that API method.
We have conducted a survey that involves 16 Android developers to evaluate the quality of our automatically generated summaries and compare them with the official Android documentation.
arXiv Detail & Related papers (2021-11-27T18:49:51Z) - Exploiting Method Names to Improve Code Summarization: A Deliberation
Multi-Task Learning Approach [5.577102440028882]
We design a novel multi-task learning (MTL) approach for code summarization.
We first introduce the tasks of generation and informativeness prediction of method names.
A novel two-pass deliberation mechanism is then incorporated into our MTL architecture to generate more consistent intermediate states.
arXiv Detail & Related papers (2021-03-21T17:52:21Z) - Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z) - Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems [74.8759568242933]
Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG)
A research challenge is to learn each module with the least amount of samples given the high cost related to the data collection.
We evaluate the priming few-shot ability of language models in the NLU, DP and NLG tasks.
arXiv Detail & Related papers (2020-08-14T08:23:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.