Fugu-MT 論文翻訳(概要): Towards Summarizing Code Snippets Using Pre-Trained Transformers

論文の概要: Towards Summarizing Code Snippets Using Pre-Trained Transformers

arxiv url: http://arxiv.org/abs/2402.00519v1
Date: Thu, 1 Feb 2024 11:39:19 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-02 15:39:39.400723
Title: Towards Summarizing Code Snippets Using Pre-Trained Transformers
Title（参考訳）: 事前学習トランスフォーマーを用いたコードスニペットの要約に向けて
Authors: Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, Gabriele Bavota
Abstract要約: この作業では、DLモデルをトレーニングしてコードスニペットを文書化するために必要なすべてのステップを取ります。我々のモデルは84%の精度でコード要約を識別し、それらを文書化されたコード行にリンクすることができる。これにより、ドキュメント化されたコードスニペットの大規模なデータセットを構築することが可能になった。
参考スコア（独自算出の注目度）: 20.982048349530483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When comprehending code, a helping hand may come from the natural language comments documenting it that, unfortunately, are not always there. To support developers in such a scenario, several techniques have been presented to automatically generate natural language summaries for a given code. Most recent approaches exploit deep learning (DL) to automatically document classes or functions, while little effort has been devoted to more fine-grained documentation (e.g., documenting code snippets or even a single statement). Such a design choice is dictated by the availability of training data: For example, in the case of Java, it is easy to create datasets composed of pairs <Method, Javadoc> that can be fed to DL models to teach them how to summarize a method. Such a comment-to-code linking is instead non-trivial when it comes to inner comments documenting a few statements. In this work, we take all the steps needed to train a DL model to document code snippets. First, we manually built a dataset featuring 6.6k comments that have been (i) classified based on their type (e.g., code summary, TODO), and (ii) linked to the code statements they document. Second, we used such a dataset to train a multi-task DL model, taking as input a comment and being able to (i) classify whether it represents a "code summary" or not and (ii) link it to the code statements it documents. Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code with recall and precision higher than 80%. Third, we run this model on 10k projects, identifying and linking code summaries to the documented code. This unlocked the possibility of building a large-scale dataset of documented code snippets that have then been used to train a new DL model able to document code snippets. A comparison with state-of-the-art baselines shows the superiority of the proposed approach.
Abstract（参考訳）: コードを解釈する際には、自然言語のコメントから手伝うことがあり、残念ながら、いつもそこにあるとは限らない。このようなシナリオで開発者をサポートするために、与えられたコードに対して自然言語サマリーを自動的に生成するテクニックがいくつか提案されている。最近のアプローチでは、クラスや関数を自動的にドキュメント化するためにディープラーニング(DL)を利用しているが、よりきめ細かいドキュメント(コードスニペットの文書化や単一のステートメントなど)にはほとんど注力していない。例えば、javaの場合、<method, javadoc>ペアで構成されたデータセットを簡単に作成できます。このようなコメントからコードへのリンクは、いくつかのステートメントを文書化する内部コメントに関しては自明ではない。この作業では、DLモデルをトレーニングしてコードスニペットを文書化するために必要なすべてのステップを取ります。まず、手動で6.6kのコメントを含むデータセットを構築しました。 (i)その種類(例えば、コード概要、todo)に基づいて分類し、 (ii) それらが文書化するコードステートメントにリンクする。第二に、このようなデータセットを使ってマルチタスクのDLモデルをトレーニングし、コメントを入力して実行できるようにしました。 (i)「コード概要」を表すか否かを分類し、 (ii)それを文書化したコード文にリンクする。我々のモデルは84%の精度でコード要約を識別し、80%以上の精度で文書化されたコード行にリンクすることができる。第3に、このモデルを10kプロジェクト上で実行し、コード要約をドキュメントコードに識別し、リンクします。これにより、ドキュメント化されたコードスニペットの大規模なデータセットを構築して、コードスニペットをドキュメント化可能な新しいDLモデルをトレーニングすることが可能になる。最先端のベースラインと比較すると,提案手法の優位性を示している。

論文の概要: Towards Summarizing Code Snippets Using Pre-Trained Transformers

関連論文リスト