How to get better embeddings with code pre-trained models? An empirical
study
- URL: http://arxiv.org/abs/2311.08066v1
- Date: Tue, 14 Nov 2023 10:44:21 GMT
- Title: How to get better embeddings with code pre-trained models? An empirical
study
- Authors: Yu Zhao and Lina Gong and Haoxiang Zhang and Yaoshen Yu and Zhiqiu
Huang
- Abstract summary: We study five different code pre-trained models (PTMs) to generate embeddings for downstream classification tasks.
We find that embeddings obtained through special tokens do not sufficiently aggregate the semantic information of the entire code snippet.
The quality of code embeddings obtained by combing code data and text data in the same way as pre-training the PTMs is poor and cannot guarantee richer semantic information.
- Score: 6.220333404184779
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models have demonstrated powerful capabilities in the
field of natural language processing (NLP). Recently, code pre-trained model
(PTM), which draw from the experiences of the NLP field, have also achieved
state-of-the-art results in many software engineering (SE) downstream tasks.
These code PTMs take into account the differences between programming languages
and natural languages during pre-training and make adjustments to pre-training
tasks and input data. However, researchers in the SE community still inherit
habits from the NLP field when using these code PTMs to generate embeddings for
SE downstream classification tasks, such as generating semantic embeddings for
code snippets through special tokens and inputting code and text information in
the same way as pre-training the PTMs. In this paper, we empirically study five
different PTMs (i.e. CodeBERT, CodeT5, PLBART, CodeGPT and CodeGen) with three
different architectures (i.e. encoder-only, decoder-only and encoder-decoder)
on four SE downstream classification tasks (i.e. code vulnerability detection,
code clone detection, just-in-time defect prediction and function docstring
mismatch detection) with respect to the two aforementioned aspects. Our
experimental results indicate that (1) regardless of the architecture of the
code PTMs used, embeddings obtained through special tokens do not sufficiently
aggregate the semantic information of the entire code snippet; (2) the quality
of code embeddings obtained by combing code data and text data in the same way
as pre-training the PTMs is poor and cannot guarantee richer semantic
information; (3) using the method that aggregates the vector representations of
all code tokens, the decoder-only PTMs can obtain code embeddings with
semantics as rich as or even better quality than those obtained from the
encoder-only and encoder-decoder PTMs.
Related papers
- Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection? [30.84647604639891]
We investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection.
We propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks.
arXiv Detail & Related papers (2024-08-09T04:56:26Z) - ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization [21.886950861445122]
Code summarization aims to automatically generate succinct natural language summaries for given code snippets.
This paper proposes a novel approach to improve code summarization based on summary-focused tasks.
arXiv Detail & Related papers (2024-07-01T03:06:51Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - On Leveraging Encoder-only Pre-trained Language Models for Effective
Keyphrase Generation [76.52997424694767]
This study addresses the application of encoder-only Pre-trained Language Models (PLMs) in keyphrase generation (KPG)
With encoder-only PLMs, although KPE with Conditional Random Fields slightly excels in identifying present keyphrases, the KPG formulation renders a broader spectrum of keyphrase predictions.
We also identify a favorable parameter allocation towards model depth rather than width when employing encoder-decoder architectures with encoder-only PLMs.
arXiv Detail & Related papers (2024-02-21T18:57:54Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - An Exploratory Study on Code Attention in BERT [8.488193857572211]
We investigate the attention behavior of PLM on code and compare it with natural language.
We show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token in NLP.
The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP.
arXiv Detail & Related papers (2022-04-05T21:23:10Z) - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.