Towards Summarizing Code Snippets Using Pre-Trained Transformers
- URL: http://arxiv.org/abs/2402.00519v1
- Date: Thu, 1 Feb 2024 11:39:19 GMT
- Title: Towards Summarizing Code Snippets Using Pre-Trained Transformers
- Authors: Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia
Tufano, Emad Aghajani, Gabriele Bavota
- Abstract summary: In this work, we take all the steps needed to train a DL model to document code snippets.
Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code.
This unlocked the possibility of building a large-scale dataset of documented code snippets.
- Score: 20.982048349530483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When comprehending code, a helping hand may come from the natural language
comments documenting it that, unfortunately, are not always there. To support
developers in such a scenario, several techniques have been presented to
automatically generate natural language summaries for a given code. Most recent
approaches exploit deep learning (DL) to automatically document classes or
functions, while little effort has been devoted to more fine-grained
documentation (e.g., documenting code snippets or even a single statement).
Such a design choice is dictated by the availability of training data: For
example, in the case of Java, it is easy to create datasets composed of pairs
<Method, Javadoc> that can be fed to DL models to teach them how to summarize a
method. Such a comment-to-code linking is instead non-trivial when it comes to
inner comments documenting a few statements. In this work, we take all the
steps needed to train a DL model to document code snippets. First, we manually
built a dataset featuring 6.6k comments that have been (i) classified based on
their type (e.g., code summary, TODO), and (ii) linked to the code statements
they document. Second, we used such a dataset to train a multi-task DL model,
taking as input a comment and being able to (i) classify whether it represents
a "code summary" or not and (ii) link it to the code statements it documents.
Our model identifies code summaries with 84% accuracy and is able to link them
to the documented lines of code with recall and precision higher than 80%.
Third, we run this model on 10k projects, identifying and linking code
summaries to the documented code. This unlocked the possibility of building a
large-scale dataset of documented code snippets that have then been used to
train a new DL model able to document code snippets. A comparison with
state-of-the-art baselines shows the superiority of the proposed approach.
Related papers
- Building A Coding Assistant via the Retrieval-Augmented Language Model [24.654428111628242]
We propose a retrieval-augmeNted language model (CONAN) to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding.
It consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G)
arXiv Detail & Related papers (2024-10-21T17:34:39Z) - Context-aware Code Summary Generation [11.83787165247987]
Code summary generation is the task of writing natural language descriptions of a section of source code.
Recent advances in Large Language Models (LLMs) and other AI-based technologies have helped make automatic code summarization a reality.
We present an approach for including this context in recent LLM-based code summarization.
arXiv Detail & Related papers (2024-08-16T20:15:34Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - DocCoder: Generating Code by Retrieving and Reading Docs [87.88474546826913]
We introduce DocCoder, an approach that explicitly leverages code manuals and documentation.
Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
arXiv Detail & Related papers (2022-07-13T06:47:51Z) - CERT: Continual Pre-Training on Sketches for Library-Oriented Code
Generation [46.45445767488915]
We show how to leverage an unlabelled code corpus to train a model for library-oriented code generation.
We craft two benchmarks named PandasEval and NumpyEval to evaluate library-oriented code generation.
arXiv Detail & Related papers (2022-06-14T14:44:34Z) - InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling)
InCoder is trained to generate code files from a large corpus of permissively licensed code.
Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.