Fugu-MT 論文翻訳(概要): Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

論文の概要: Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

arxiv url: http://arxiv.org/abs/2201.11313v1
Date: Thu, 27 Jan 2022 04:15:59 GMT
ステータス: 翻訳完了
システム内更新日: 2022-01-28 15:01:00.265220
Title: Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus
Title（参考訳）: CodeSearchNet Corpusを用いたコード検索のための深部意味モデル学習
Authors: Chen Wu and Ming Yan
Abstract要約: マルチモーダル・ソースのユーティリティを利用する新しいディープ・セマンティック・モデルを提案する。提案したモデルを適用して,意味的コード検索に関するCodeSearchNetの課題に対処する。我々のモデルはCodeSearchNetコーパスでトレーニングされ、ホールドアウトデータに基づいて評価され、最終モデルは0.384 NDCGに達し、このベンチマークで優勝した。
参考スコア（独自算出の注目度）: 17.6095840480926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantic code search is the task of retrieving relevant code snippet given a natural language query. Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic concepts and semantics. Recently, deep neural network for code search has been a hot research topic. Typical methods for neural code search first represent the code snippet and query text as separate embeddings, and then use vector distance (e.g. dot-product or cosine) to calculate the semantic similarity between them. There exist many different ways for aggregating the variable length of code or query tokens into a learnable embedding, including bi-encoder, cross-encoder, and poly-encoder. The goal of the query encoder and code encoder is to produce embeddings that are close with each other for a related pair of query and the corresponding desired code snippet, in which the choice and design of encoder is very significant. In this paper, we propose a novel deep semantic model which makes use of the utilities of not only the multi-modal sources, but also feature extractors such as self-attention, the aggregated vectors, combination of the intermediate representations. We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search. We align cross-lingual embedding for multi-modality learning with large batches and hard example mining, and combine different learned representations for better enhancing the representation learning. Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark. Models and code are available at https://github.com/overwindows/SemanticCodeSearch.git.
Abstract（参考訳）: セマンティックコード検索は、自然言語クエリによって関連するコードスニペットを取得するタスクである。典型的な情報検索タスクとは違って、コード検索は言語と自然言語のセマンティックなギャップを埋め、本質的な概念と意味論をよりよく記述する必要がある。近年,コード検索のためのディープニューラルネットワークがホットな研究トピックとなっている。ニューラルコード検索の典型的な方法は、まずコードスニペットとクエリテキストを別々の埋め込みとして表現し、次にベクトル距離(ドット製品やコサインなど)を使用してそれらの意味的類似度を計算する。バイエンコーダ、クロスエンコーダ、ポリエンコーダなど、コードの可変長やクエリトークンを学習可能な埋め込みに集約する方法には、さまざまなものがある。クエリエンコーダとコードエンコーダの目標は、関連するクエリペアと対応する所望のコードスニペットに対して互いに近接した埋め込みを生成し、エンコーダの選択と設計が非常に重要であることである。本稿では,マルチモーダル音源の効用を利用するだけでなく,自己アテンション,集約ベクトル,中間表現の組み合わせといった特徴抽出器も活用した,新しい深層意味モデルを提案する。提案したモデルを用いて,意味コード検索に関するCodeSearchNetの課題に取り組む。マルチモーダル学習のための言語間埋め込みを,大規模バッチやハードサンプルマイニングと整合させ,異なる学習表現を組み合わせることで,表現学習の向上を図る。我々のモデルはCodeSearchNetコーパスでトレーニングされ、保持データに基づいて評価され、最終モデルは0.384 NDCGに達し、このベンチマークで優勝した。モデルとコードはhttps://github.com/overwindows/semanticcodesearch.gitで入手できる。

論文の概要: Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

関連論文リスト